We have an issue when loading an html file using Aspose.Pdf, and trying to retrieve the text.
· When saving the aspose document as PDF directly, the embedded text is fine
· But when extracting the text using a TextAbsorber, characters the text retrieved seems to be shifted backwards by 29 positions (‘S’ becomes ‘6’, ‘u’ becomes ‘X’, etc)
Source code samples:
Snippet for loading the HTML:
Document objDocument = new Document("document.html", new HtmlLoadOptions());
Snippet for text extraction:
TextAbsorber objTextAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
objDocument.Pages.Accept(objTextAbsorber);
File.WriteAllText("document.txt", objTextAbsorber.Text, Encoding.Unicode);
Snippet for PDF creation:
objDocument.Save("document.pdf");
Problem highlight:
Example sentence in the original document:
· Sue lost ouer 65 pounds and learned
Sentence copy/pasted from generated PDF document:
· Sue lost ouer 65 pounds and learned
Corresponding sentence extracted via the TextAbsorber:
· 6XH ORVW RXHU (1)(2) SRXQGV DQG OHDUQHG
o (1) being character 0x19
o (2) being character 0x18
In attach:
document.html: the original HTML file (the image resources are missing but this is not a concern – behavior is identical with/without)
document.pdf: the document saved as PDF
document.txt: the text extracted from the document via the TextAbsorber
Best regards,