InvalidOperationException when parsing HTML stream

getting an InvalidOperationException - “Nullable object must have a value”.

using the following code:

var httpClient = new HttpClient();
var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2022-12-05/AsOf/TOT5/C1TOC.html";
var cancellationToken = new CancellationToken();
var request = new HttpRequestMessage(HttpMethod.Get, url);

using var result = httpClient.Send(request, HttpCompletionOption.ResponseContentRead, cancellationToken);
result.EnsureSuccessStatusCode();

using var stream = result.Content.ReadAsStream(cancellationToken);
stream.Position = 0;

var memoryStream = new MemoryStream();
stream.CopyTo(memoryStream);
memoryStream.Position = 0;

(new { Length = memoryStream.Length, Position = memoryStream.Position }).Dump();

var options = new HtmlLoadOptions(url)
{
HtmlMediaType = HtmlMediaType.Print,
PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth,

PageInfo = new PageInfo { Width = PageSize.PageLetter.Width, Height = PageSize.PageLetter.Height, IsLandscape = false },
};

using var pdf = new Document(memoryStream, options);
var docInfo = new DocumentInfo(pdf)
{
Title = "title"
};

@PhilBoyd

The InvalidOperationException you’re encountering, which states “Nullable object must have a value,” may be related to how the HTML content is being fetched and processed. It seems that the issue arises from the way the HttpClient is being used to retrieve the HTML stream.

One potential solution is to ensure that the HTTP request is fully completed before attempting to read the stream. In some cases, using GetAwaiter().GetResult() can lead to issues where the operation does not complete as expected. Instead, consider using asynchronous methods properly to avoid blocking calls.

Here’s a modified version of your code that uses await to ensure the request completes before processing the stream:

var httpClient = new HttpClient();
var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2022-12-05/AsOf/TOT5/C1TOC.html";
var cancellationToken = new CancellationToken();
var request = new HttpRequestMessage(HttpMethod.Get, url);

using var result = await httpClient.SendAsync(request, HttpCompletionOption.ResponseContentRead, cancellationToken);
result.EnsureSuccessStatusCode();

using var stream = await result.Content.ReadAsStreamAsync(cancellationToken);
stream.Position = 0;

var memoryStream = new MemoryStream();
await stream.CopyToAsync(memoryStream, cancellationToken);
memoryStream.Position = 0;

var options = new HtmlLoadOptions(url)
{
    HtmlMediaType = HtmlMediaType.Print,
    PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth,
    PageInfo = new PageInfo { Width = PageSize.PageLetter.Width, Height = PageSize.PageLetter.Height, IsLandscape = false },
};

using var pdf = new Document(memoryStream, options);
var docInfo = new DocumentInfo(pdf)
{
    Title = "title"
};

In this code, I’ve replaced the synchronous calls with their asynchronous counterparts, which should help ensure that the stream is fully loaded before you attempt to read from it. This change may resolve the InvalidOperationException you are experiencing.

If the problem persists, please check if the HTML content being fetched is valid and complete, as any issues with the content itself could also lead to exceptions during processing.

Thanks - but I’m not doing any async operations. And this code works for another page.

@PhilBoyd

Can you please confirm if you have tried with the latest version of the API? Also, please share the complete stack trace information with us so that we can further proceed accordingly.

Going to do that this morning - my client currently only has V22 in house.

this causes stack overflow in V24

using Aspose.Html;
using Aspose.Html.Converters;
using Aspose.Html.Saving;
using PdfTemp;

var httpClient = new HttpClient();
var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2022-12-05/AsOf/TOT5/C1TOC.html";
//var url = @"https://manuals.health.mil/pages/DisplayManualHtmlFile/2024-10-24/AsOf/TOT5/FOREWORD.html";
var cancellationToken = new CancellationToken();
var request = new HttpRequestMessage(HttpMethod.Get, url);

using var result = httpClient.Send(request, HttpCompletionOption.ResponseContentRead, cancellationToken);
result.EnsureSuccessStatusCode();

using var stream = result.Content.ReadAsStream(cancellationToken);
stream.Seek(0, SeekOrigin.Begin);


using var streamProvider = new MemoryStreamProvider();
using var doc = new HTMLDocument(stream, url);
Converter.ConvertHTML(doc, new PdfSaveOptions(), streamProvider);

var pdfStream = streamProvider.Streams.First();
using (var fileStream = File.Create("file.pdf"))
{
    pdfStream.Seek(0, SeekOrigin.Begin);
    pdfStream.CopyTo(fileStream);
}

@PhilBoyd

Can you please share the screenshot of the error and stack trace information as well? We will further proceed with this information accordingly.

So V24 fixed that issue, but the PDF formatting is horrendous. Adding the base href to the html does not fix the formatting issue. And adding the base href to the HtmlLoadOptions constructor causes a stack overflow.

url: " https://manuals.health.mil/pages/DisplayManualHtmlFile/2024-10-31/AsOf/TOT5/FOREWORD.html"

var baseHref = "https://manuals.health.mil"
var options = new HtmlLoadOptions(baseHref)
{
    PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth,
    PageInfo = new PageInfo { Width = PageSize.PageLetter.Width, Height = PageSize.PageLetter.Height, IsLandscape = false },
};
 using var pdf = new Document(htmlStream, options);  // <--- booom!

@PhilBoyd

Would you please confirm if you are using Aspose.PDF only for now to convert HTML to PDF? Also, can you please share the generated output PDF for our reference?

Tried to upload the zip file - getting an error.

can you shoot me an email at pboyd9@humanamility.com? I can reply with the zip file.

@PhilBoyd

We have sent you a private message where you can share the link of the .zip file by uploading it to Google Drive or Dropbox.