Extract text from image - Language=German - Not 100% correct

Hallo there,
at first: Aspose OCR works a lot faster than other products, that I have tested. That is good news.
But: There are some words, that are not recognized very good.
E.g. AsPose extracts
handle but it should be handelt
folaenden but it should be folgenden
sonddrn but it should be sondern
Särmtliche but it should be Sämtliche
archivierl but it should be archiviert
zuageordnet but it should be zugeordnet
Posteinaana but it should be Posteingang

Do you have an explanation for this?
Warm regards, Gerd

@GerdRein

Could you please share a sample image with us so that we can test the scenario in our environment and address it accordingly. Also, please confirm in which platform you are using the API e.g. .NET/Java.

Hello, yes, I can. Please look at the attachment. This product is really good. If I can help to make it more better, please let me know. But now we can not use it, because too many faults, I hope we can fix this. Bildschirmfoto3.png (318.4 KB)

@GerdRein

Thanks for sharing the sample file.

Would you please also provide the information above. We will further proceed to assist you accordingly.

MacOs 11.1 / Java

@GerdRein

The support for European Languages is yet to be implemented in the API and it is a work in progress. We intend to provide support for Chinese as well as European Language Recognition in this year. Furthermore, a ticket as OCRJAVA-103 has been logged in our issue tracking system for your case. We will further investigate it and let you know as soon as it is resolved. Please give us some time.

We apologize for the inconvenience.

Dear Asad, no problem and don’t think too much about this. I am convinced, that you succeed. I will be watching the progress. If I can contribute something, please let me know. I would be glad to contribute, because I am retired and I appreciate your product.

@GerdRein

Thanks for your kind feedback.

The task to add support for the languages under question is ongoing and we will surely update this forum thread once it is completed. You can post your suggestions and other requirements as well in this forum thread in case you have some. We will surely consider and include them in the ongoing process.

@GerdRein

We have made improvements in 21.2 version that will be published next week. The recognition result would be better in mentioned version.

Hi, before I start a test: Did you make significant improvements, especially also for German Umlauts?

@GerdRein

We are checking the related information at our end and will get back to you shortly.

@GerdRein

We would like to share with you that we have made some improvements in 21.2 version of the API. You can achieve the attached results using this version. All diacritical symbols are recognized.

result.zip (937 Bytes)

Thank you! It looks good, I will try it!

Sorry, I do not find the jar-File for OCR21.2.
Where can I get this?
… I found it, thx!

Well, I tested a little bit and the result is really very very much better. But not perfect.
Some Problem arises with M. I don’t know, why under some circumstances the OCR-Process doesn’t like it. Sometimes it generates Ml or something like this.
Also something is wrong with the recognition of ". Always “” ist generated (but this is not a big problem). Sometimes - when there is not enough white space around the text, then ocr cuts of. Sometimes the first line is not recognized. I give you some pictures, so you can proof this.
But meanwhile thank you for this version! It is nearly perfect.testocr3.JPG (178.6 KB)
testocr4.JPG (473.9 KB)
testocr5.JPG (31.8 KB)
testocr6.JPG (353.2 KB)

@GerdRein

Thanks for providing your feedback. We have updated the ticket information as per it and will consider these issues for fix as well. We will further inform you once additional updates are available regarding ticket resolution.

@GerdRein

In the last release, we have significantly improved our model and have a good recognition result. We added the result .txt file for the testocr3 image.

About errors that you pointed out:

  • If you notice the first and last lines being cut or any lines being dropped, we recommend using the setDetectAreas(false) mode, this gives an excellent result in such cases
  • We did not notice any errors with the letter M, but, unfortunately, the doubling of quotation marks is still present. In future releases, we hope to eliminate these errors.

Release version on which we tested your images - Aspose.Ocr.Java 21.3
Code example

AsposeOCR api = new AsposeOCR();
RecognitionSettings set = new RecognitionSettings();
set.setDetectAreas(false);
RecognitionResult res = api.RecognizePage("testocr3.JPG", set);
System.out.println(res.recognitionText); 

testocr3.zip (648 Bytes)

Thank you! I tested it and it looks really very good!
So, I have another question: Is there any recommendation to preprocess pictures? Sometimes the quality is not very good. Does Aspose-OCR any preprocessing internally (black&white, noise-reduction, etc.)?

@GerdRein

At the moment, Aspose.OCR does not support preprocessing of the images. However, we will surely investigate the feasibility and let you know in this forum thread.

@GerdRein

You can do skew correction as well as adjust threshold value in order to obtain better recognition results using Aspose.OCR. Please check the next sample code snippet:

AsposeOcr api = new AsposeOcr();
var res = api.RecognizeImage(imgPath,
new RecognitionSettings
{
AutoSkew = true, // it's the default value, but you can set false and disable skew correction
ThresholdValue = 230 // set threshold by yourself, if you don't it will be automatically calculated
});