PDF OCR is JUNK and GArbage

mreddyNYC · June 22, 2016, 1:54pm

HI

PDF document OCR is very bad

Some of the extraction files are

];n n l a [] ili]n][] ] u }{]]]]]]]]-][I i l ly/[//////Ii!][i]g?][][]\[][//[//i/[][][][/ li][][/////i/IF][] -lli][][Ii!][i]g?][][]\[][//[//i/[][][][//[///j/]]][//[///////////l/f/[][][][//////[/][][][][][1]}]][][][///////l/[////////// tli][/////l/i/[//l/[////////////

Please help us ASAP.

Regards,

Michael

ikram.haq · June 22, 2016, 2:16pm

Hi Michael,

Thank you for your inquiry.

Please go through the following online documentation link for details on how to perform OCR operation on PDF. In case of any issue forward us the sample PDF file that you are trying at your end. We will look into it and update you accordingly.

Performing OCR on PDF Documents

jon_elster_i3intel_com · August 15, 2020, 4:06pm

I’m having the same issue…
What is the solution?
I can read my PDF fine

jon_elster_i3intel_com · August 15, 2020, 6:48pm

How can I attach a screenshot?

asad.ali · August 16, 2020, 7:30pm

@jon_elster_i3intel_com

You can attach the file using upload button in the post editor while replying in the thread.

jon_elster_i3intel_com · August 16, 2020, 9:46pm

Here’s another debug window2.PNG.jpg (309.7 KB)

Here’s the original image1_out.jpg (2.8 MB)

asad.ali · August 17, 2020, 6:29pm

@jon_elster_i3intel_com

We have logged an issue as OCRNET-251 in our issue tracking system for incorrect text extraction from the image. We will investigate it in details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

jon_elster_i3intel_com · August 24, 2020, 1:26pm

Hi…
What’s the status ? I paid for this and a FREE OCR program does it much better!
thanks

asad.ali · August 24, 2020, 7:35pm

@jon_elster_i3intel_com

We are currently in process to analyze the earlier logged ticket and as soon as it is fully investigated, we will update the progress details in this forum thread. We highly appreciate your patience in this matter. Please give us some time.

We apologize for your inconvenience.

asad.ali · August 26, 2020, 8:09pm

@jon_elster_i3intel_com

We made an investigation of this issue and found that our “Document Layout Detection” (DSR) algorithm is not intended to detect text regions on such complex tabled layout. Our main goal was to recognize office documents, contracts, printed books. We are just starting to work on more complex forms of documents.

Moreover, the “downloadable” OCR products have the light version of DSR with worse quality, because of restrictions in ONNX neural networks player. The only solution for such documents with complex layouts recognition is to manually set-up regions of an image for text recognition.

Tesseract works well because its algorithm detects single characters. This is an outdated approach and has many disadvantages in other cases. Nevertheless, we intend to start a new OCR engine which will be able to process receipts, forms, lists, tables, and other complex layouts.

We have prepared code example for this issue which uses manually set-uped regions of an image for text recognition (Rectangles) and in such case it works fine:

AsposeOcr api = new AsposeOcr();
string imgPath = "image1_out.jpg";

// mark up the image
List<Rectangle> rectangles = new List<Rectangle>
{
new Rectangle(133,  150,  975,  355),
new Rectangle(1235, 145,  1038, 378),
new Rectangle(144,  510,  2232, 90),
new Rectangle(132,  690,  1400, 310),
new Rectangle(1670, 685,  576,  306),
new Rectangle(880,  1090, 765,  110),

new Rectangle(120,  1200, 2304, 168),
new Rectangle(870,  1880, 815,  106),
new Rectangle(210,  2000, 2080, 666),
new Rectangle(210,  2775, 1440, 360),
new Rectangle(2010, 2740, 385,  330)
};

var result = api.RecognizeImage(imgPath, rectangles);

Issue_OCRNET_251.zip (2.7 MB)

jon_elster_i3intel_com · August 27, 2020, 12:18pm

Thanks for researching
But our documents are all different. We can’t assume the regions or positions of the text. In our case we trying to ‘pick up’ ORDER NUMBER which could be anywhere in these documents. So It doesn’t look like Aspose.OCR will work. We will use the Tesseract for now, unfortunately. Unless you have any other suggestions.
thanks again

asad.ali · August 27, 2020, 8:16pm

@jon_elster_i3intel_com

Thanks for providing us your feedback.

We have noted your concerns and will further investigate the ticket accordingly. We will inform you as soon as it is resolved. Please spare us some time.

We are sorry for the inconvenience.