Problems with OCR of some PDF

GRein · April 8, 2022, 4:00pm

Hi, we have massive Problems with some special(?) PDF. Please find attached one example-PDF. We are using Aspose-PDF 22.1 for Java.
We are using TessCallBackGetHocr() for OCR. Now, this attached file (31 pages) is calling the invoke-Method more than 40.000 (!!) times. Mostly the images are like this:
Image Nr.7:6x1 px
Image Nr.8:6x1 px
Image Nr.9:6x1 px
Image Nr.10:6x1 px
Image Nr.11:14x1 px
Image Nr.12:6x1 px
This makes no sense to give them to the OCR-Engine (=Tess4J). This is why we filtered out images below a specified threshold.
My Question is:
1.) Is there a possibility to check the number of images that will reach the invoke-Method BEFORE we try to OCR an PDF? (If I check before every page with imagecollection = resources.getImages(), I don’ get 40.000++ Images). This would be a good solution for filtering out such PDF’s before.
2.) Our filtering-Process in the invoke()-Method gives back some standard-html (see below). Also, this is not a good solution for 40.000 images. Is there any alternative? (If I give back an empty String or only a " ", then Aspose throws an Exception. Here is the Standard-HTML, that is working:

2022-04-08 18_01_35-2Charta-Converter – TessCallBackGetHocr.java.png (50.2 KB)

Thanks for your support,
regards, Gerd

1014008_0100020000000025_OvercomingObjections_V8.pdf (2.6 MB)

asad.ali · April 8, 2022, 7:44pm

@GRein

Regretfully, there is no way to check the number of images passing to the invoke-Method before OCR operation. Also, we need to investigate the possibility for any alternative for this case. Can you please share the complete sample code snippet that you are using at your side to carry out the functionality? We will log an investigation ticket and share the ticket ID with you.

GRein · April 11, 2022, 7:09am

Hallo,
thank you for your response. You have already the pdf. This is only one of many, which show the same problem.
The code, this is the calling method:

doc = new Document(pdfInput);

            callback = new TessCallBackGetHocr();
            callback.setCancelAdapter(this.cancelAdapter);
            try
            {
                doc.convert(callback, false, true);
            }
            catch (Exception e)
            {
                // Das war jetzt entweder eine Exception wegen isCanceled, oder wegen anderer Gründe
                if (!isCanceled)
                {
                    // Also eine Exception aus anderen Gründen, dann werfen wir die gleich nochmals.
                    throw e;
                }
            }

and here is the TessCallBack:

public class TessCallBackGetHocr implements CallBackGetHocr
{
private Tesseract tesseract;
private int minSizeThreshold = 50; // Minimale Width/Height, die ein Image haben sollte
private CancelAdapter cancelAdapter = new CancelAdapter(); // Damit der nie null ist.
private final String defaultOcr = (see previously attached file)

private int imageNr = 0;

public TessCallBackGetHocr()
{
    tesseract = new Tesseract();
    tesseract.setDatapath(Config.getInstance().getOCR_TessData());
    tesseract.setLanguage("deu+eng");
    // Die Bedeutung der folgenden 2 Konstanten siehe : http://tess4j.sourceforge.net/docs/docs-4.4/constant-values.html
    tesseract.setPageSegMode(TessPageSegMode.PSM_AUTO);
    tesseract.setOcrEngineMode(TessOcrEngineMode.OEM_TESSERACT_ONLY);
    // tesseract.setTessVariable("user_defined_dpi", "" + 300);
    tesseract.setHocr(true);
}

@Override
/**
 * Implementierung der hOCR-Callback-Methode für Aspose.
 */
public String invoke(java.awt.image.BufferedImage img)
{
    imageNr++;
    System.out.println("Image Nr." + imageNr + ":" +  img.getWidth() + "x" + img.getHeight());
    String ocrresult = defaultOcr;
    if (this.cancelAdapter.isCanceled())
    {
        // dann erzwingen wir eine Exception. In diesem Fall macht Aspose eine IndexOutOfBoundsException
        return "";
    }
    else if (img.getWidth() < minSizeThreshold || img.getHeight() < minSizeThreshold)
    {
        // wenn irgendeine Dimension des Bildes kleiner als der Schwellwert ist, dann kann der OCR nichts erkennen.
        // Insofern tun wir so, als ob der OCR nichts gefunden hat.
        return defaultOcr;
    }
    else
    {
        try
        {
            // ansonsten normale tesseract-OCR
            ocrresult = tesseract.doOCR(img);
        }
        catch (Exception e)
        {
        }
    }
    return ocrresult;
}

public void setCancelAdapter(CancelAdapter adapter)
{
    if (adapter != null)
    {
        this.cancelAdapter = adapter;
    }
}

}

asad.ali · April 11, 2022, 2:28pm

@GRein

We have logged an investigation ticket as PDFJAVA-41506 in our issue management system in order to analyze your requirements. We will look into details of the scenario and let you know as soon as some definite updates are available regarding ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.