Problems trying to read all text from image

aphonenumber · March 10, 2017, 3:54pm

Problem: Can't get expected results from OCR text recognition.

I am a licensed user and I have successfully converted a PDF to a jpg with 300dpi, which is the attached 'dpi300.jpg' file.

All tests are being run on this 'dpi300.jpg' file.

ATTEMPT #1:

First I tried something very basic: ( see ocr1.txt attachment for source code and results )

( code is at the top. results are below the "================================" line.)

This first result has *two major problems*:

First, the text does not appear in the correct order. Notice how "Socket Programming HOWTO" and "Guido van Rossum", which are near the top of the image, appear in the middle of the table of contents.

Second, the entire 'Abstract' section, which appears within a rectangular border in the image, does not appear in the result.

ATTEMPT #2:

By setting 'DetectTextRegions = false', the text appears in the correct order: ( see ocr2.txt attachment )

This second result still has 2 problems:

First, the entire 'Abstract' section is still missing.

Second, the text "Author Gordon McMillan" is now missing.

ATTEMPT #3:

I tried a config option called 'DeleteTableLines'. ( see ocr3.txt attachment )

Now the 'Abstract' section and "Author Gordon McMillan" both appear, but another problem is created:

The problem now is that some of the text is removed incorrectly, such as "ocket Programming" instead of "Socket Programming HOWTO" on the first line.

How can I successfully read ALL of the text from this image, even if it's inside a rectangle?

(It would be nice if the dots "................" would not be recognized as "owelywompeopywyeep", but that is less important.)

Thank you.

ikram.haq · March 13, 2017, 7:09am

Hi,

Thank you for your inquiry and sharing details.

This is to update you that we have investigated the issue at our end. Initial investigation shows that the issue persists. The issue has been logged into our system with ID OCRNET-3184. Our product team will further look into it and provide feedback. We will update you with the feedback in this thread.

awais.hafeez · March 29, 2018, 5:23am

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.