Poor quality OCR versus Tesseract

frimbingpickering · April 28, 2023, 2:49pm

Hi there!

We use Aspose.OCR in conjunction with Aspose.Pdf to convert a TIF file into a searchable PDF. Attached you will find an example of the TIF files.

We are a bit disappointed with the OCR quality, especially compared to the OCR quality of (the free) Tesseract. Attached you will find two PDFs, one created by Aspose and the other by Tesseract, so that you can assess the differences firsthand.

An indictation of how we use Aspose is as follows.

For the attached examples, we’ve used the German language model.

Settings = new RecognitionSettings();
Settings.Language = Aspose.OCR.Language.Deu

        public MemoryStream RecognizeImage(MemoryStream image)
        {
            var pdf = new MemoryStream();
            var result = engine.RecognizeImage(image, Settings);
            result.Save(pdf, SaveFormat.Pdf);
            return pdf;
        }

Can you say something about the differences? Do you utilize Tesseract under the hood yourselves?

Input and output.zip (2.4 MB)

asad.ali · April 28, 2023, 11:00pm

@frimbingpickering

We have opened an investigation ticket as below in our issue management system to further analyze this case.

Issue ID(s): OCRNET-664

We will definitely look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

frimbingpickering · May 5, 2023, 9:08am

@asad.ali, thank you. Any news yet? Can you give us an indication of when we will receive a response?

asad.ali · May 5, 2023, 2:05pm

@frimbingpickering

The ticket is sadly not investigated yet. We will investigate it on a first come first serve basis and as soon as we make some progress towards its resolution, we will update you in this forum thread. Please spare us some time.

We apologize for your inconvenience.

frimbingpickering · May 16, 2023, 12:13pm

Okay, understood, we’ll wait and see.

Maybe while you investigate this, you can answer the following question already:

Or do you use an engine of your own? And if so, is it based on Tesseract?

Thank you!

asad.ali · May 16, 2023, 9:10pm

@frimbingpickering

No, we don’t use Tesseract in any way. We use our own Neural network. Aspose.OCR Cloud has been using models from our own development and we use them in downloadable versions too. We use these models for all below features:

recognize text
areas with text detection
dewarping curved text, for clear noises
cirrilic, latin, hindi and chinese model

asad.ali · May 29, 2023, 7:15pm

@frimbingpickering

Would you please try to use this mode:

OcrInput input = new OcrInput(InputType.TIFF);
input.Add(imgPath);
var result = api.Recognize(input, new RecognitionSettings
{
 DetectAreasMode = DetectAreasMode.PHOTO
});
Console.WriteLine(result[0].RecognitionText);
result[0].Save("res664.pdf", SaveFormat.Pdf);

The results: compare.zip (923.8 KB)

frimbingpickering · May 30, 2023, 11:21am

Sure, that looks better!

But why? According to the docs, DetectAreasMode.PHOTO is “better for image with a lot of pictures and other not text objects” and I think that this example image is not such an image.

How do I know which DetectAreasMode to choose? How can I advise my customers on this?

asad.ali · May 31, 2023, 10:59am

@frimbingpickering

We have recorded your feedback below the logged ticket and will let you know after performing investigation against it. Please spare us little time.

asad.ali · May 31, 2023, 11:13am

@frimbingpickering

For images with tables and lines, we advise to use PHOTO or TABLE mode. For garbled text lines (if the lines of text curl), we advise to use CURVED_TEXT.

DOCUMENT mode helpful only with structured text such as scanned contracts, book pages, etc.