PDF to text yields words in random order

Charybdis · April 4, 2018, 1:30pm

I am currently using Aspose.PDF for .Net version 17.11. I am trying to convert PDF files to text using the TextAbsorber class. which works for most files, but some yield texts that looks shuffled and also with weird spacing. Are there any specific options I should use?
The conversion code is:

var pdf = new Aspose.Pdf.Document(path);
var absorber = new TextAbsorber();
pdf.Pages.Accept(absorber);
var text = absorber.Text;

The problem can be recreated with this file: BIB-DE000DGQ4425.pdf (71.4 KB)
The conversion of the attached file results in this: extraction_result.zip (6.5 KB)

What can I do to solve this problem?

Farhan.Raza · April 4, 2018, 6:47pm

@Charybdis

Thank you for contacting support.

We would like to request you to please try using below code snippet with Aspose.PDF for .NET 18.3 in your environment. We have attached generated file for your kind reference. BIB-DE000DGQ4425_18.3.zip

        // Open document
        Document pdfDocument = new Document(dataDir + "BIB-DE000DGQ4425.pdf");

        // Create TextAbsorber object to extract text
        //TextAbsorber textAbsorber = new TextAbsorber();
        TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
        // Accept the absorber for all the pages
        pdfDocument.Pages.Accept(textAbsorber);
        // Get the extracted text
        string extractedText = textAbsorber.Text;
        // Create a writer and open the file
        TextWriter tw = new StreamWriter(dataDir + "BIB-DE000DGQ4425_18.3.txt", false, System.Text.Encoding.UTF8);
        // Write a line of text to the file
        tw.WriteLine(extractedText);
        // Close the stream
        tw.Close();

In case you notice any problem, please share a screenshot highlighting the issue so that we may try to reproduce and investigate it in our environment.

Charybdis · April 5, 2018, 9:44am

Setting the TextExtractionOptions as described by you solves the problem, even with the older version.
However I lose the format of the input which is kind of a big deal when trying to parse tables. Is there any other way?

Farhan.Raza · April 5, 2018, 6:36pm

@Charybdis

Thank you for your kind feedback.

We are glad to hear that the problem has been resolved. Suggested approach does not extract the text exactly in same format as it appears in the PDF file. It ensures extracting all text content from the PDF file, so we are afraid that the format of input may not be maintained.