We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Issue while extracting text from multi columnar pdf

Hi,

I am trying to extract data from multi columnar pdf. I am getting incorrect output. The output received is appending lines (not from the same paragraph). I want the output to append lines from the same paragraph in the same column.I am using the following code :

document = new Document(inputStream);
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
textAbsorber.getExtractionOptions().setScaleFactor(0.5);
document.getPages().accept(textAbsorber);
extractedText = textAbsorber.getText();

Here is the attached document and output
pdfextract.zip (43.1 KB)

@saurabh.arora,

Thanks for contacting support.

I have worked with source file shared by you and unable to observe the issue. I have also shared my generated result with you for your kind reference. Also you are using an old version of Aspose.PDF. Can you please try to use Aspose.PDF latest version on your end.Extracted_text.zip (2.1 KB)

Thanks for the reply.

The output you shared is the one i am getting. But since this is a multi columnar document , reading of the document is not correct.

I want the output like this :

Lorem ps’arn teter sr. asm. ccesecaetger adipiatisi eli. sed diam oaosmsiy ssbl essstod eotidam v.
laoreet telare maps abqaam eta: volspat Ut vai earn ad scorn vesiast qais oastrad exerci abas s-astt
orcer laKt^r. labarts aid v. alaqMp - x «a tzmstoda cosseqas Ess astern vel earn ’ante teter is

meaning lines of paragraph in same column appended together. Hope you understand my requirement.

@saurabh.arora,

Thanks for contacting support.

I have observed your issue and like to inform that I have created investigation ticket with ID PDFJAVA-48166 in our issue tracking system to investigate and resolve this issue as soon possible.