Hi, we have massive Problems with some special(?) PDF. Please find attached one example-PDF. We are using Aspose-PDF 22.1 for Java.
We are using TessCallBackGetHocr() for OCR. Now, this attached file (31 pages) is calling the invoke-Method more than 40.000 (!!) times. Mostly the images are like this:
Image Nr.7:6x1 px
Image Nr.8:6x1 px
Image Nr.9:6x1 px
Image Nr.10:6x1 px
Image Nr.11:14x1 px
Image Nr.12:6x1 px
This makes no sense to give them to the OCR-Engine (=Tess4J). This is why we filtered out images below a specified threshold.
My Question is:
1.) Is there a possibility to check the number of images that will reach the invoke-Method BEFORE we try to OCR an PDF? (If I check before every page with imagecollection = resources.getImages(), I don’ get 40.000++ Images). This would be a good solution for filtering out such PDF’s before.
2.) Our filtering-Process in the invoke()-Method gives back some standard-html (see below). Also, this is not a good solution for 40.000 images. Is there any alternative? (If I give back an empty String or only a " ", then Aspose throws an Exception. Here is the Standard-HTML, that is working:
2022-04-08 18_01_35-2Charta-Converter – TessCallBackGetHocr.java.png (50.2 KB)
Thanks for your support,
regards, Gerd
1014008_0100020000000025_OvercomingObjections_V8.pdf (2.6 MB)