OCR of numbers fails to recognize sequences of 3 or more same digits , drops digits

When using OCR on TIFF-Files that contain numbers the OCR tends to drop digits, when it encounter multiple same digits in a row.

Examples (on TIFF → recognized):

  • “3.333,33” € → “3.33,33 e”
  • 144266 → 14266
  • 444508 → 44508
  • CDA200360000216X → CDA20036000216X

This is a severe problem in

  • accounting applications → sums of money
  • insurance application → insurance numbers

I assume this is a systematic error.

Please fix!

We use Aspose.OCR for .NET 26.3.0

Another error, which we encountered less often, adds letters:

  • PXP139069X → PXP139069Xx
  • PXP146709X → PXP3146709X

Please provide the image you are trying to recognize. For now, I am testing on similar examples.

@JoesterA
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-1204

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hello Anna,

I want to provide these images.

But: The Images I did OCR on contain confidential customer data. When I whiten the confidential part of the graphic, for example a customers address, the errors 2 to 4 just vanish. These errors occur recognizing the subject line of a letter, directly following the address. So the address seems to lay the foundation for the bad recognition of the subject line.

The only error reproducible after clearing out the confidential parts is error nummer 1. Here the complex table before the text"3.333,33 €" seems to prime the recognition to indeed recognize “3.33,33e”.

I attach it as file “sample1.tif” in Sample1.zip:

Sample1.zip (247,3 KB)

Regards
Axel Joester

If I do not add the preprocessing filter “AutoSkew”, the result changes to “3.333,3 6”

Sorry for the delay in responding, I am trying to make some improvements to reduce these errors that appear in your image. I will provide you with an example of the result.

I have made some improvements to the recognition process so far. There are still minor errors, please review the result. These changes will be available in the May release
test.zip (2.3 KB)

Thank you, Anna!

I compared your result to my results from the same tiff.

The Words “Gesamtsaldo 3.333,33 e” used to be in separate lines in my recognition results.

In your results, they are lumped into one line.

I find the same problem in other data from the table:
The values

  • Datum
  • Zahlung
  • Saldo
  • Zinstage
  • Zinssatz
  • Zins Anmerkung

used to be in separate lines in my results. They are in one line in your result.

This change would be a big problem in my use case!

But maybe there is really no change here, and we just use different RecognitionSetting ?

I use default settings, with the exception of

  • Language = Language.Deu

So my complete recognition settings are:

  • AllowedCharacters = Aspose.OCR.CharactersAllowedType.ALL
  • AllowedSymbols = null
  • AutomaticColorlnversion = true
  • DetectAreasMode = UNIVERSAL
  • IgnoredSymbols = null
  • Language = Deu
  • LanguageDetectionLevel = ByParagraph
  • LinesFiltration = false
  • RecognizeSingleLine = false
  • RecognizeVerticalLines = false
  • ThreadsCount = 0
  • UpscaleSmallFont = false

Regards,
Axel

I tested the latest release with exactly the same settings as your - and still I always get the output in one line of the words you specified.
There is an option to get the text separately by lines (and in the case of a table it will be by cells) - this is to use RecognitionLinesResult instead of RecognitionText
1204_doubling_charact.zip (272.5 KB)

Hello Anna,

Thank you, I didn’t realize that! I do actually read from RecognitionLinesResult. You read from RecognitionText. This explains our differing results.

Then my test is positive!

Regards,
Axel

thanks for the reply. If there are no comments yet, wait for the May release