We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Issue in extracting text from multi columnar pdf

Hi,

I am trying to extract text from multi columnar pdf document. But the resultant output is not correct according to structure. I have used the following code :

document = new Document(inputStream);
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
textAbsorber.getExtractionOptions().setScaleFactor(0.5);
document.getPages().accept(textAbsorber);
extractedText = textAbsorber.getText();

It is giving output line by line (concatenating words from different column in document).

I want result text concatenated by lines in same column paragraph.

Attaching my input document and output text for your reference. I am using 18.7 version.

pdfextract.zip (43.1 KB)

Hi Team,

Is there any update?
Please help as it is blocking us in production.

@saurabh.arora,

Thanks for contacting support.

I have worked with source file shared by you and unable to observe the issue. I have also shared my generated result with you for your kind reference. Also you are using an old version of Aspose.PDF. Can you please try to use Aspose.PDF latest version on your end.Extracted_text.zip (2.1 KB)

Thanks for the reply.

The output you shared is the one i am getting. But since this is a multi columnar document , reading of the document is not correct.

I want the output like this :

Lorem ps’arn teter sr. asm. ccesecaetger adipiatisi eli. sed diam oaosmsiy ssbl essstod eotidam v.
laoreet telare maps abqaam eta: volspat Ut vai earn ad scorn vesiast qais oastrad exerci abas s-astt
orcer laKt^r. labarts aid v. alaqMp - x «a tzmstoda cosseqas Ess astern vel earn ’ante teter is

meaning lines of paragraph in same column appended together. Hope you understand my requirement.

@saurabh.arora,

Thanks for contacting support.

I have observed your issue and like to inform that I have created investigation ticket with ID PDFJAVA-48166 in our issue tracking system to investigate and resolve this issue as soon possible.