Strange text extracted using TextAbsorber

dhuhn · April 10, 2014, 8:59am

I am trying to extract the text from a PDF file using the TextAbsorber class; here is the code I am using:

Stream stream = File.OpenRead(“test.pdf”);

using (Document pdfDocument = new Document(stream)) {

TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));

pdfDocument.Pages.Accept(textAbsorber);

File.WriteAllText(“pdfContent.txt”, textAbsorber.Text);

}

I would expect that the content of the pdfContent.txt file contains “Pdf test” multiple times, see the attached pdfContent.txt file for actual content. (Note: Not using aspose licence here to keep things simple)

Can you tell me whats wrong with the PDF file?

codewarior · April 10, 2014, 11:59pm

Hi Daniel,

Thanks
for using our API’s.<o:p></o:p>

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-36742. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

dhuhn · April 28, 2014, 8:31am

Hello there,

any progress so far? Is the issue tracker public? Can I see a status anywhere?

Greets from Germany

Daniel

codewarior · April 28, 2014, 12:04pm

Hi Daniel,

Thanks for your patience.

We have further investigated the issue reported earlier and as per our observations, It is not possible to extract text from this document. The document’s font does not provide mapping to Unicode.

aspose.notifier · May 2, 2014, 7:01am

The issues you have found earlier (filed as PDFNEWNET-36742) have been fixed in Aspose.Pdf for .NET 9.2.0.

The blog post for this release is created over this link

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.