How can I extract all the text with co-ordinates x-y and all the information using Aspose.PDF?

paulpre · December 19, 2021, 7:30pm

Ok here it is
page1.pdf (24.9 KB)
And here is the file with text + coordinates enumerated with the (very) old Adobe lib:
page1.zip (4.9 KB)

Thank you very much for taking a look.

I tried with ParagraphAbsorber > MarkupSection > MarkupParagraph > TextFragment > TextSegment, but same result, and all the text in the table is not listed.
Perhaps I should try to list the table, but it should be simple to have all the texts & textstates without to know more.

asad.ali · December 19, 2021, 8:28pm

@paulpre

We are checking it and will get back to you shortly.

paulpre · December 19, 2021, 8:34pm

Ok thank you very much.

For information I tested an other way to enumerate the table content, but same result.
The code :
com.aspose.pdf.TableAbsorber absorber = new com.aspose.pdf.TableAbsorber();
absorber.visit( page );
for (com.aspose.pdf.AbsorbedTable table : absorber.getTableList()) {
System.out.println( "Table" );
for (com.aspose.pdf.AbsorbedRow row : table.getRowList()) {
for (com.aspose.pdf.AbsorbedCell cell : row.getCellList()) {
for (com.aspose.pdf.TextFragment fragment : cell.getTextFragments()) {
for (com.aspose.pdf.TextSegment tseg : fragment.getSegments()) {
System.out.println( String.format( "- '%s'", tseg.getText() ) );
} } } } }

And the result :
Table
- 'Account'
- 'Prior Period Current Month Actual'
- 'Number Description'
- 'Balance Actuals YearTo Date'

And at last a test with the TextDevice.
Only this one with “Pure” parameter seems to recognize that the yellow header contains all the different areas.
I deduce this because it place them in the result text file at the right locations.
You can see this on the screenshot :
image.png (3.5 KB)

asad.ali · December 19, 2021, 9:32pm

@paulpre

Thanks for sharing more information. We will test the scenario from this perspective as well and will get back to you soon.

asad.ali · December 20, 2021, 4:51pm

@paulpre

Could you please also share the code snippet where you used the TextDevice along with the coordinates values of the extracted text? We need to further investigate this case and we need this information in order to log an investigation ticket.

By reconstructing the PDF, do you mean that you are generating a new PDF document from scratch using Aspose.PDF? If possible, please share that code snippet as well. We will further proceed accordingly.

paulpre · December 21, 2021, 11:07am

Hello

The code where I use TextDevice just produce a text file, but I did not find how to get the associated coortinates of the text. Is there a way to obtain the text coordinates with this way ? I miss the sources of your classes…
I gave you above the screen shot of the text file produced, here it is : image.png

Here is the code :
try (java.io.OutputStream text_stream = new java.io.FileOutputStream( _LogDir+"/ExtractedText_Pure.txt", false )) {
for (Page page : _PDFDocument.getPages()) {
TextDevice textDevice = new TextDevice();
TextExtractionOptions textExtOptions = new TextExtractionOptions( TextExtractionOptions.TextFormattingMode.Pure );
textDevice.setExtractionOptions( textExtOptions );
textDevice.process( page, text_stream );
} }

By reconstructing the PDF yes I mean that we are generating a new PDF document from scratch, but no, not with Aspose.PDF but with the very old Adobe lib (which we would like to replace with Aspose). We bought your global license, we are the company Everteam and have a normal Aspose paid account, sorry if I use my personal account here.

The processing of the application (a set of different programs) is :
1- open a “spool” PDF file (contains printed documents sent to users like invoices, …)
2- enumerate all information : text, images, annnotations, fonts, …
3- build a text file (like the one I sent yesterday above : page1.zip) which contains the info, the text, and the coordinates, with lines like “TO2:408.05:638.25:463.45:638.25:Current Month”
4- save this text file
5- index text in a database

Then when a user does a search with indexed text, then we construct a document like the original, our server has to :
a- create a new PDF empty document
b- add the right background image on each page of the document to rebuild
c- add images + text with the font+coordinates remembered found in the saved text file
d- return the document to the http response stream to the user

The problem is that the Adobe lib we use for steps 2- and 3- is too much old (more than 20 years) and does not always fully recognize data in recent PDF files.

Originally all the programs are done in c++ with Visual c++ 6. Then with our paid account I asked if your c++ sdk is usable with Visual c++ 6, but the answer was No…

So the solution I found is to modify our c++ ocx responsible for steps 2- and 3-, to make it call a java class method using JNI.
I did it so now the only problem to solve is for the java class using Aspose to obtain the coordinates for all the text parts like the old Adobe lib does, as showed in the text file I uploaded 1 or 2 days ago.

So we only need Aspose for steps 2- and 3- : the less we touch C ++ programs, the better.

asad.ali · December 21, 2021, 5:13pm

@paulpre

Thanks for elaborating on the scenario. We have generated an investigation ticket as PDFJAVA-41162 in our issue management system to further analyze this case and your complete requirements. We will check the feasibility to offer text coordinates values along with Pure Formatting option OR offer Pure Formatting for extraction in TextFragmentAbsorber Class. We will let you know as soon as we have some updates regarding ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

paulpre · December 21, 2021, 5:16pm

Oh very nice, thank you for this. We will waiting for it.