Document.Convert not processing a well formatted hocr object to make searchable pdf

ILSTech · September 4, 2018, 6:11pm

Dear Support,

I am trying to use Aspose.PDF (version 18.8.0) for creating searchable pdfs. I have followed your cookbook in other forum posts using Tesseract. For many pages, the process is working flawlessly, however, I have several pages that are not. When clicking on the page in the Acrobat reader I get an error message stating:

Cannot find or create the font ‘NIRQDO+TimesNewRoman’. Some characters may not display or print correctly.

If I try to highlight any text on the page, it only selects images, no text. The hocr object created by tesseract looks valid with all of the text intact. I will attach a sample pdf file (test.pdf), a console program source file (Program.cs) and the hocr object created by tesseract.Test.pdf (126.1 KB)
Test.zip (124.0 KB)

asad.ali · September 5, 2018, 12:10am

@ILSTech

Thanks for contacting support.

We have tested the scenario in our environment by using this third party [tesseract-ocr] and Aspose.PDF for .NET 18.8. We were able to notice that output PDF was not correctly searchable. However, the issues which you have mentioned could not be observed.

Would you please share which tesseract-ocr you are using along with sharing complete sample console application. So that we can again test the scenario in our environment and address it accordingly.

ILSTech · September 5, 2018, 2:34pm

Dear Support,

The version of Tesseract that I installed was found on this page:

https://github.com/UB-Mannheim/tesseract/wiki

I realize now that I had installed a beta version of Tesseract 4.0. So, I uninstalled this and used the 3.05 installer from this link:

https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.02-20180621.exe

With this version, I noticed the OCR is better, but there is still the font anomaly and some unusual artifacts after the OCR process.

I am attaching a zip file with my VS solution. Inside is the complete document file that the page came from. If you run this program, you will notice that pages 20 and 21 OCR successfully. No other pages require OCR until pages 64 through 82. If you go to page 64 in the output pdf file you will get the font message I referred to. On subsequent pages there are strange rectangles that show up when you press Control A. On pages 67 and 72 these recangles are very large and you cannot click on the text that is inside them. When you click on the rectangle you get a very large blinking cursor.

Thank you in advance for your support.

MakeSearchablePDF.zip (5.8 MB)

ILSTech · September 5, 2018, 6:18pm

Here is a much smaller pdf file that gets the same missing font message when using the same sample program. In this case, there are no bates stamps on the page.0000001.pdf (200.0 KB)

asad.ali · September 5, 2018, 10:16pm

@ILSTech

Thanks for sharing sample application.

We have tried to execute your application by modifying respective directories but the program threw a exception of FileNotFound at following line of code:

StreamReader streamReader = new StreamReader(@"E:\Data\out.hocr");

Complete code snippet of the method with modified directory is as follows:

string dir = @"E:\Data\";
img.Save(dir + "ocrtest.png");
ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"E:\Data\ocrtest.png E:\Data\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"E:\Data\out.hocr");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;

FileNotFound.png (10.6 KB).

Would you please share a sample application which is able to replicate the issue without throwing any error. In case we have missed something while running your sample application, please let us know.

ILSTech · September 5, 2018, 11:26pm

Dear Support,

This code is taken from a post on your site where you were instructing someone in how to do this. It works on my PC. See post https://forum.aspose.com/t/creating-searchable-pdfs-ocr/172824

The one thing I notice is that in the above mentioned post, the file you were opening is @“E:\Data\out.html”. I adjusted this because I saw that tesseract was creating the file as out.hocr. Maybe either the version of Tesseract or its configuration options are generating the filename differently. Can you set a breakpoint right before this and see what Tesseract is creating and adjust the extension accordingly?

Thanks In Advance

asad.ali · September 6, 2018, 9:41am

@ILSTech

It seems that there is difference between your and our environment which is why tesseract is unable to execute and generate output in our environment. Since you were able to execute the code snippet successfully in your environment, would you please share details about it e.g OS Name, Version, x64/x86, etc. We will try to test the scenario in specific environment and address it accordingly.

ILSTech · September 6, 2018, 3:46pm

My workstation is running Windows 10 Home (64 bit). I am building the program in Visual Studio 2017. I installed Tesseract from this link: https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.02-20180621.exe

asad.ali · September 6, 2018, 10:52pm

@ILSTech

Thanks for sharing more details.

Would you please share the output PDF document which was generated on your side after performing HOCR over this PDF.

ILSTech · September 20, 2018, 2:36pm

I am sorry it has taken a while to get back. Here is the output of converting the file “0000001.pdf” I shared with you above.test2.pdf (420.1 KB)

asad.ali · September 20, 2018, 8:19pm

@ILSTech

Thanks for getting back to us.

We have logged an investigation ticket as PDFNET-45418 in our issue tracking system for this scenario. Please note that we were still unable to produce similar output in our environment and tesseract related commands would not execute. However, we will still try to investigate the scenario further keeping in view the details provided by you. We will keep you posted with investigation progress status within this thread. Please spare us little time.

We are sorry for the inconvenience.