InvalidOperationException when parsing HTML stream

PhilBoyd · October 29, 2024, 6:35pm

getting an InvalidOperationException - “Nullable object must have a value”.

using the following code:

var httpClient = new HttpClient();
var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2022-12-05/AsOf/TOT5/C1TOC.html";
var cancellationToken = new CancellationToken();
var request = new HttpRequestMessage(HttpMethod.Get, url);

using var result = httpClient.Send(request, HttpCompletionOption.ResponseContentRead, cancellationToken);
result.EnsureSuccessStatusCode();

using var stream = result.Content.ReadAsStream(cancellationToken);
stream.Position = 0;

var memoryStream = new MemoryStream();
stream.CopyTo(memoryStream);
memoryStream.Position = 0;

(new { Length = memoryStream.Length, Position = memoryStream.Position }).Dump();

var options = new HtmlLoadOptions(url)
{
HtmlMediaType = HtmlMediaType.Print,
PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth,

PageInfo = new PageInfo { Width = PageSize.PageLetter.Width, Height = PageSize.PageLetter.Height, IsLandscape = false },
};

using var pdf = new Document(memoryStream, options);
var docInfo = new DocumentInfo(pdf)
{
Title = "title"
};

Professionalize.Discourse · October 29, 2024, 6:35pm

@PhilBoyd

The InvalidOperationException you’re encountering, which states “Nullable object must have a value,” may be related to how the HTML content is being fetched and processed. It seems that the issue arises from the way the HttpClient is being used to retrieve the HTML stream.

One potential solution is to ensure that the HTTP request is fully completed before attempting to read the stream. In some cases, using GetAwaiter().GetResult() can lead to issues where the operation does not complete as expected. Instead, consider using asynchronous methods properly to avoid blocking calls.

Here’s a modified version of your code that uses await to ensure the request completes before processing the stream:

var httpClient = new HttpClient();
var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2022-12-05/AsOf/TOT5/C1TOC.html";
var cancellationToken = new CancellationToken();
var request = new HttpRequestMessage(HttpMethod.Get, url);

using var result = await httpClient.SendAsync(request, HttpCompletionOption.ResponseContentRead, cancellationToken);
result.EnsureSuccessStatusCode();

using var stream = await result.Content.ReadAsStreamAsync(cancellationToken);
stream.Position = 0;

var memoryStream = new MemoryStream();
await stream.CopyToAsync(memoryStream, cancellationToken);
memoryStream.Position = 0;

var options = new HtmlLoadOptions(url)
{
    HtmlMediaType = HtmlMediaType.Print,
    PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth,
    PageInfo = new PageInfo { Width = PageSize.PageLetter.Width, Height = PageSize.PageLetter.Height, IsLandscape = false },
};

using var pdf = new Document(memoryStream, options);
var docInfo = new DocumentInfo(pdf)
{
    Title = "title"
};

In this code, I’ve replaced the synchronous calls with their asynchronous counterparts, which should help ensure that the stream is fully loaded before you attempt to read from it. This change may resolve the InvalidOperationException you are experiencing.

If the problem persists, please check if the HTML content being fetched is valid and complete, as any issues with the content itself could also lead to exceptions during processing.

PhilBoyd · October 29, 2024, 6:39pm

Thanks - but I’m not doing any async operations. And this code works for another page.

asad.ali · October 29, 2024, 10:50pm

@PhilBoyd

Can you please confirm if you have tried with the latest version of the API? Also, please share the complete stack trace information with us so that we can further proceed accordingly.

PhilBoyd · October 30, 2024, 1:02pm

Going to do that this morning - my client currently only has V22 in house.

PhilBoyd · October 30, 2024, 3:28pm

this causes stack overflow in V24

using Aspose.Html;
using Aspose.Html.Converters;
using Aspose.Html.Saving;
using PdfTemp;

var httpClient = new HttpClient();
var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2022-12-05/AsOf/TOT5/C1TOC.html";
//var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2024-10-24/AsOf/TOT5/FOREWORD.html";
var cancellationToken = new CancellationToken();
var request = new HttpRequestMessage(HttpMethod.Get, url);

using var result = httpClient.Send(request, HttpCompletionOption.ResponseContentRead, cancellationToken);
result.EnsureSuccessStatusCode();

using var stream = result.Content.ReadAsStream(cancellationToken);
stream.Seek(0, SeekOrigin.Begin);


using var streamProvider = new MemoryStreamProvider();
using var doc = new HTMLDocument(stream, url);
Converter.ConvertHTML(doc, new PdfSaveOptions(), streamProvider);

var pdfStream = streamProvider.Streams.First();
using (var fileStream = File.Create("file.pdf"))
{
    pdfStream.Seek(0, SeekOrigin.Begin);
    pdfStream.CopyTo(fileStream);
}

asad.ali · October 30, 2024, 8:00pm

@PhilBoyd

Can you please share the screenshot of the error and stack trace information as well? We will further proceed with this information accordingly.

PhilBoyd · November 1, 2024, 3:22pm

So V24 fixed that issue, but the PDF formatting is horrendous. Adding the base href to the html does not fix the formatting issue. And adding the base href to the HtmlLoadOptions constructor causes a stack overflow.

url: " https://manuals.health.mil/pages/DisplayManualHtmlFile/2024-10-31/AsOf/TOT5/FOREWORD.html"

var baseHref = "https://manuals.health.mil"
var options = new HtmlLoadOptions(baseHref)
{
    PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth,
    PageInfo = new PageInfo { Width = PageSize.PageLetter.Width, Height = PageSize.PageLetter.Height, IsLandscape = false },
};
 using var pdf = new Document(htmlStream, options);  // <--- booom!

asad.ali · November 1, 2024, 9:48pm

@PhilBoyd

Would you please confirm if you are using Aspose.PDF only for now to convert HTML to PDF? Also, can you please share the generated output PDF for our reference?

PhilBoyd · November 4, 2024, 3:15pm

Tried to upload the zip file - getting an error.

PhilBoyd · November 4, 2024, 3:16pm

can you shoot me an email at pboyd9@humanamility.com? I can reply with the zip file.

asad.ali · November 4, 2024, 9:24pm

@PhilBoyd

We have sent you a private message where you can share the link of the .zip file by uploading it to Google Drive or Dropbox.