Hello,
currently we extract (visible/rendered) text from attached HTML file (like the text when you load the HTML into a browser) with following code:
var htmlLoadOptions = new HtmlLoadOptions
{
PageLayoutOption = HtmlPageLayoutOption.ScaleToPageWidth
};
using (var pdfDocument = new Document(HtmlFile, htmlLoadOptions))
{
var textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
return new StringBuilder(textAbsorber.Text);
}
With attached file, a exception (see also attached screenshot) will be thrown (but delayed), which cannot be catched:
System.ObjectDisposedException
bei System.Threading.CancellationTokenSource.ThrowObjectDisposedException()
bei #=z9x_fmliPZS6WYVDrqbMcKbg=+#=zeZvdCM9$OZO$R26MQg==.#=zD3$O$CJn_lkvKVtqxg==(System.Object)
bei System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
bei System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
bei System.Threading.TimerQueueTimer.CallCallback()
bei System.Threading.TimerQueueTimer.Fire()
bei System.Threading.TimerQueue.FireNextTimers()
HtmlExtraction.zip (35,7 KB)
Screenshot 2025-05-12 163802.png (26,2 KB)
Kind regards,
Andy