Problem: Can't get expected results from OCR text recognition.
I am a licensed user and I have successfully converted a PDF to a jpg with 300dpi, which is the attached 'dpi300.jpg' file.
All tests are being run on this 'dpi300.jpg' file.
ATTEMPT #1:
First I tried something very basic: ( see ocr1.txt attachment for source code and results )
( code is at the top. results are below the "================================" line.)
This first result has *two major problems*:
First, the text does not appear in the correct order. Notice how "Socket Programming HOWTO" and "Guido van Rossum", which are near the top of the image, appear in the middle of the table of contents.
Second, the entire 'Abstract' section, which appears within a rectangular border in the image, does not appear in the result.
ATTEMPT #2:
By setting 'DetectTextRegions = false', the text appears in the correct order: ( see ocr2.txt attachment )
This second result still has 2 problems:
First, the entire 'Abstract' section is still missing.
Second, the text "Author Gordon McMillan" is now missing.
ATTEMPT #3:
I tried a config option called 'DeleteTableLines'. ( see ocr3.txt attachment )
Now the 'Abstract' section and "Author Gordon McMillan" both appear, but another problem is created:
The problem now is that some of the text is removed incorrectly, such as "ocket Programming" instead of "Socket Programming HOWTO" on the first line.
How can I successfully read ALL of the text from this image, even if it's inside a rectangle?
(It would be nice if the dots "................" would not be recognized as "owelywompeopywyeep", but that is less important.)
Thank you.