Hello
The code where I use TextDevice just produce a text file, but I did not find how to get the associated coortinates of the text. Is there a way to obtain the text coordinates with this way ? I miss the sources of your classes…
I gave you above the screen shot of the text file produced, here it is : image.png
Here is the code :
try (java.io.OutputStream text_stream = new java.io.FileOutputStream( _LogDir+"/ExtractedText_Pure.txt", false )) {
for (Page page : _PDFDocument.getPages()) {
TextDevice textDevice = new TextDevice();
TextExtractionOptions textExtOptions = new TextExtractionOptions( TextExtractionOptions.TextFormattingMode.Pure );
textDevice.setExtractionOptions( textExtOptions );
textDevice.process( page, text_stream );
} }
By reconstructing the PDF yes I mean that we are generating a new PDF document from scratch, but no, not with Aspose.PDF but with the very old Adobe lib (which we would like to replace with Aspose). We bought your global license, we are the company Everteam and have a normal Aspose paid account, sorry if I use my personal account here.
The processing of the application (a set of different programs) is :
1- open a “spool” PDF file (contains printed documents sent to users like invoices, …)
2- enumerate all information : text, images, annnotations, fonts, …
3- build a text file (like the one I sent yesterday above : page1.zip) which contains the info, the text, and the coordinates, with lines like “TO2:408.05:638.25:463.45:638.25:Current Month”
4- save this text file
5- index text in a database
Then when a user does a search with indexed text, then we construct a document like the original, our server has to :
a- create a new PDF empty document
b- add the right background image on each page of the document to rebuild
c- add images + text with the font+coordinates remembered found in the saved text file
d- return the document to the http response stream to the user
The problem is that the Adobe lib we use for steps 2- and 3- is too much old (more than 20 years) and does not always fully recognize data in recent PDF files.
Originally all the programs are done in c++ with Visual c++ 6. Then with our paid account I asked if your c++ sdk is usable with Visual c++ 6, but the answer was No…
So the solution I found is to modify our c++ ocx responsible for steps 2- and 3-, to make it call a java class method using JNI.
I did it so now the only problem to solve is for the java class using Aspose to obtain the coordinates for all the text parts like the old Adobe lib does, as showed in the text file I uploaded 1 or 2 days ago.
So we only need Aspose for steps 2- and 3- : the less we touch C ++ programs, the better.