Issue while extracting text from multi columnar pdf

saurabh.arora · May 26, 2020, 3:41pm

Hi,

I am trying to extract data from multi columnar pdf. I am getting incorrect output. The output received is appending lines (not from the same paragraph). I want the output to append lines from the same paragraph in the same column.I am using the following code :

document = new Document(inputStream);
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
textAbsorber.getExtractionOptions().setScaleFactor(0.5);
document.getPages().accept(textAbsorber);
extractedText = textAbsorber.getText();

Here is the attached document and output
pdfextract.zip (43.1 KB)

Adnan.Ahmad · May 27, 2020, 11:08am

@saurabh.arora,

Thanks for contacting support.

I have worked with source file shared by you and unable to observe the issue. I have also shared my generated result with you for your kind reference. Also you are using an old version of Aspose.PDF. Can you please try to use Aspose.PDF latest version on your end.Extracted_text.zip (2.1 KB)

saurabh.arora · May 27, 2020, 11:38am

Thanks for the reply.

The output you shared is the one i am getting. But since this is a multi columnar document , reading of the document is not correct.

I want the output like this :

Lorem ps’arn teter sr. asm. ccesecaetger adipiatisi eli. sed diam oaosmsiy ssbl essstod eotidam v.
laoreet telare maps abqaam eta: volspat Ut vai earn ad scorn vesiast qais oastrad exerci abas s-astt
orcer laKt^r. labarts aid v. alaqMp - x «a tzmstoda cosseqas Ess astern vel earn ’ante teter is
…
…

meaning lines of paragraph in same column appended together. Hope you understand my requirement.

Adnan.Ahmad · May 28, 2020, 12:55pm

@saurabh.arora,

Thanks for contacting support.

I have observed your issue and like to inform that I have created investigation ticket with ID PDFJAVA-48166 in our issue tracking system to investigate and resolve this issue as soon possible.