Issue in extracting text from multi columnar pdf

saurabh.arora · May 25, 2020, 4:25pm

Hi,

I am trying to extract text from multi columnar pdf document. But the resultant output is not correct according to structure. I have used the following code :

document = new Document(inputStream);
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
textAbsorber.getExtractionOptions().setScaleFactor(0.5);
document.getPages().accept(textAbsorber);
extractedText = textAbsorber.getText();

It is giving output line by line (concatenating words from different column in document).

I want result text concatenated by lines in same column paragraph.

Attaching my input document and output text for your reference. I am using 18.7 version.

pdfextract.zip (43.1 KB)

saurabh.arora · May 26, 2020, 3:42pm

Hi Team,

Is there any update?
Please help as it is blocking us in production.

Adnan.Ahmad · May 27, 2020, 11:07am

@saurabh.arora,

Thanks for contacting support.

I have worked with source file shared by you and unable to observe the issue. I have also shared my generated result with you for your kind reference. Also you are using an old version of Aspose.PDF. Can you please try to use Aspose.PDF latest version on your end.Extracted_text.zip (2.1 KB)

saurabh.arora · May 27, 2020, 11:37am

Thanks for the reply.

The output you shared is the one i am getting. But since this is a multi columnar document , reading of the document is not correct.

I want the output like this :

Lorem ps’arn teter sr. asm. ccesecaetger adipiatisi eli. sed diam oaosmsiy ssbl essstod eotidam v.
laoreet telare maps abqaam eta: volspat Ut vai earn ad scorn vesiast qais oastrad exerci abas s-astt
orcer laKt^r. labarts aid v. alaqMp - x «a tzmstoda cosseqas Ess astern vel earn ’ante teter is
…
…

meaning lines of paragraph in same column appended together. Hope you understand my requirement.

Adnan.Ahmad · May 28, 2020, 12:55pm

@saurabh.arora,

Thanks for contacting support.

I have observed your issue and like to inform that I have created investigation ticket with ID PDFJAVA-48166 in our issue tracking system to investigate and resolve this issue as soon possible.

maochen · November 5, 2024, 5:52am

how to resolve this problem at last?

asad.ali · November 5, 2024, 3:56pm

@maochen

We are afraid that the earlier logged ticket could not get resolved due to other pending issues in the queue. Your concerns have been recorded and we will notify you as soon as the issue is resolved. We apologize for the inconvenience.