Strange text extracted using TextAbsorber

I am trying to extract the text from a PDF file using the TextAbsorber class; here is the code I am using:


Stream stream = File.OpenRead(“test.pdf”);

using (Document pdfDocument = new Document(stream)) {
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));

pdfDocument.Pages.Accept(textAbsorber);
File.WriteAllText(“pdfContent.txt”, textAbsorber.Text);
}

I would expect that the content of the pdfContent.txt file contains “Pdf test” multiple times, see the attached pdfContent.txt file for actual content. (Note: Not using aspose licence here to keep things simple)

Can you tell me whats wrong with the PDF file?

Hi Daniel,


Thanks
for using our API’s.<o:p></o:p>

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-36742. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

Hello there,


any progress so far? Is the issue tracker public? Can I see a status anywhere?

Greets from Germany
Daniel

Hi Daniel,


Thanks for your patience.

We have further investigated the issue reported earlier and as per our observations, It is not possible to extract text from this document. The document’s font does not provide mapping to Unicode.

The issues you have found earlier (filed as PDFNEWNET-36742) have been fixed in Aspose.Pdf for .NET 9.2.0.

The blog post for this release is created over this link


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.