Text Extraction (Names) split in different rows

cameyk · March 22, 2019, 11:18am

Hi,

We are trying to extract text from PDF files using aspose pdf java with paid version. In some files text are group and not present in single line.

like below example

Merrill Lynch BMO Nesbitt CIBC World
Canada Inc. Burns Inc. Markets Inc.

In above example aspose is unable to read those lines although others reader (desktop softwares) can point cursor to Canada after Lynch and to Burns after Nesbitt.

Here we want “Merrill Lynch Canada Inc” as single line and all others too.

It will be a good help if you provide a java code to extract data for same.

Thanks & regards,
Amey

asad.ali · March 22, 2019, 5:03pm

@cameyk

Thanks for contacting support.

Would you please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

cameyk · March 26, 2019, 8:29am

Hi,

My concern is that Aspose Pdf for java should read text as per cursor position move (in other pdf reader softwares) not the way it visualize in PDF.

please try to create pdf from url given below and provide a code in java which will extract correct data or sentences.

https://www.sec.gov/Archives/edgar/data/225090/000119312513284910/d565048dsuppl.htm

Thanks & regards,
Amey

asad.ali · March 26, 2019, 5:10pm

@cameyk

Thanks for sharing more details.

We have tested the scenario in our environment by using following code snippets and observed that API extracted the text differently than how your require. However, would you please share a sample code snippet which you are using at your side. It would help us understanding complete scenario and address it accordingly.

First Code Snippet

Document doc = new Document(dataDir + "WebpageToPdf.pdf");
TextAbsorber ta = new TextAbsorber();
ta.visit(doc);
System.out.println(ta.getText());

Second Code Snippet

Document doc = new Document(dataDir + "WebpageToPdf.pdf");
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
   for (MarkupSection ms:pm.getSections()){

                for (MarkupParagraph mp:ms.getParagraphs()){
                    StringBuilder sb =new StringBuilder();
                    for(java.util.List<TextFragment> tflist : mp.getLines()){
                        for(TextFragment tf:tflist ){
                            sb.append(tf.getText());
                        }
                        sb.append("/r/n");
                    }
                    sb.append("/r/n");
                    System.out.println(sb);
                }
            }
     }

cameyk · March 27, 2019, 5:00am

Hi,

I have also tried same code snippets tested by you before raising this issue, where I got wrong sentences.
First code snippet is working for normal paragraphs, but failed to read the text from different blocks of words. In above sample “CIBC World Markets Inc.” is not read correctly.

Second code snippet works for some time but whenever there is little space between two separate word blocks it gives wrong result. In above example if space between “National Bank” and “Scotia Capital” is not good enough then it failed to read it as “National Bank Financial Inc.” instead it read it as “National Bank Scotia Capital Inc. Financial Inc.” even though the cursor moves correctly.

So in brief if you tried to read the document using some pdf reader, the cursor movement is always correct. So I need a code sample where it will read code as the cursor moves in document. Both above code snippets failed to do that. If there is other option to read it please help us to do that.

Thanks & regards,
Amey Kadam

asad.ali · March 27, 2019, 3:09pm

@cameyk

We are further testing the scenario and will share our feedback with you shortly.

asad.ali · March 27, 2019, 7:50pm

@cameyk

We regret that we cannot offer any code sample for your requirements as this feature may need further investigation. However, please check following screenshot of the console output in our environment where text seems in correct style.
ExtracedText.png (5.0 KB)

Furthermore, could you please share respective screenshots of the issues you mentioned above. Also, it would be helpful if you can please share the PDF document which you have obtained at your side from the URL shared by you. We will further proceed to assist you accordingly.