Issue when loading an html file into Aspose.Pdf

Vorennor · November 5, 2014, 8:02am

We have an issue when loading an html file using Aspose.Pdf, and trying to retrieve the text.

· When saving the aspose document as PDF directly, the embedded text is fine

· But when extracting the text using a TextAbsorber, characters the text retrieved seems to be shifted backwards by 29 positions (‘S’ becomes ‘6’, ‘u’ becomes ‘X’, etc)

Source code samples:

Snippet for loading the HTML:

Document objDocument = new Document("document.html", new HtmlLoadOptions());

Snippet for text extraction:

TextAbsorber objTextAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));

objDocument.Pages.Accept(objTextAbsorber);

File.WriteAllText("document.txt", objTextAbsorber.Text, Encoding.Unicode);

Snippet for PDF creation:

objDocument.Save("document.pdf");

Problem highlight:

Example sentence in the original document:

· Sue lost ouer 65 pounds and learned

Sentence copy/pasted from generated PDF document:

· Sue lost ouer 65 pounds and learned

Corresponding sentence extracted via the TextAbsorber:

· 6XH ORVW RXHU (1)(2) SRXQGV DQG OHDUQHG

o (1) being character 0x19

o (2) being character 0x18

In attach:

document.html: the original HTML file (the image resources are missing but this is not a concern – behavior is identical with/without)

document.pdf: the document saved as PDF

document.txt: the text extracted from the document via the TextAbsorber

Best regards,

tilal.ahmad · November 5, 2014, 10:41pm

Hi,

Thanks for your inquiry. Please complete HTML to PDF conversion first and then extract the text. Please check following code snippet for the purpose. Hopefully it will help you to accomplish the task.

MemoryStream htmltopdf = new MemoryStream();<o:p></o:p>

Document objDocument = new Document(myDir + "document.html",new HtmlLoadOptions());

objDocument.Save(htmltopdf);

objDocument = new Document(htmltopdf);

TextAbsorber objTextAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));

objDocument.Pages.Accept(objTextAbsorber);

File.WriteAllText(myDir+"document.txt", objTextAbsorber.Text, Encoding.Unicode);

objDocument.Dispose();

Please feel free to contact us for any further assistance.

Best Regards,

Vorennor · November 6, 2014, 6:12am

Hi,

Thanks for the workaround, indeed the text retrieved now is correct; however the formatting is different.

In the generated text file, lines are broken (carriage returns added) according to layout, while in the source HTML file they are continuous. For example, let’s take the bottom-left caption:

Taken from the HTML file:
“Sue had been asked to sit in on some photos when she took her toddler for portraits in 1991.”

Corresponding lines taken from the extracted text (generated as Raw):

“Sue had been asked to
sit in on some photos
when she took her
toddler for portraits in
1991.”

Is there some way to workaround this problem also?

Thanks!

tilal.ahmad · November 7, 2014, 1:14am

Hi there,

Thanks for your feedback. Actually Aspose.Pdf tries to mimic html browser view in PDF as much as possible, please find attached PDF document. TextAbsorber extracted the text according to text formatting in PDF. Please elaborate your requirement a bit more, you are want to text from HTML without formatting or want HTML to PDF conversion as well. So we will look into it and update you accordingly.

Best Regards,