OCR order of text when adds text to a PDF

rmachincsd · August 23, 2022, 4:21pm

Using Aspose.PDF for Java (version 19.3), I would like to add some text stamps to a previously generated PDF file and GET the text as it is displayed, no matter what the PDF viewer is.

That is, no matter what technology is used to stamp text in the PDF file, PdfFileMend, PdfFileStamp, TextStamp, TextFragment or PdfViewer, when I try to find the text in the generated file, it is selected in the order inserted and not in the order of display.

In particular, I’m creating the first PDF using Aspose.Cells-java and then opening the PDF using a byte array or tmp file, but it’s the same thing.

I attached a project example with different shapes and the only way I found was using: TextStamp & textStamp.setBackground(true) in case the stamp is the first text on the page.

Screen Shot 2022-08-23 at 13.10.18.png (15.4 KB)
aspose-pdf-ocr.zip (246.9 KB)

asad.ali · August 23, 2022, 8:34pm

@rmachincsd

Are you trying to find existing content in the PDF or the watermark text? Also, please share the code snippet that you are using to find/extract the text so that we can further test the scenario in our environment and address it accordingly.

rmachincsd · August 23, 2022, 10:58pm

The project example uses apachebox to extract it. but you also might use chrome, firefox or some other native view with the find or ctrl+f feature.

asad.ali · August 24, 2022, 12:10pm

@rmachincsd

We are testing the scenario in our environment and will get back to you shortly.

asad.ali · April 6, 2023, 8:29pm

@rmachincsd

Would you please try using TextAbsorber class to extract the text from an existing PDF using 23.3 version of the API as this extracts the text in the same way it is displayed in the PDF? In case you are still unable to achieve what you require, please let us know.