Creating searchable pdfs (ocr)

asad.ali · January 9, 2020, 7:00pm

You may please use any C# to VB Converter utility in order to convert this code in VB.NET.

jmika99 · September 29, 2020, 12:22pm

Using the C# code I am able to convert to searchable PDF but I am getting a missing font message when opening the converted PDF in Adobe. I’ve attached a sample image PDF which is representative of all image PDFs at my client site. test_10.pdf (248.2 KB)
I do not get this message using other tools (including Adobe) when converting to searchable PDF. We currently use Aspose at my client site and are very interested is using this functionality. Any help would be greatly appreciated.

asad.ali · September 29, 2020, 8:02pm

@jmika99

Would you please also share the output PDF document which is showing error while opening. Please also share a sample code snippet that you are using at your side to generate it. We will test the scenario in our environment and address it accordingly.

jmika99 · September 29, 2020, 8:14pm

Output file output_10.pdf (374.9 KB)

Code:
static void Main(string[] args)
{
var doc = new Document(“c:/temp/test_10.pdf”);
doc.Convert(CallBackGetHocr);
doc.Save(“C:/temp/output_10.pdf”);
}
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @“C:\temp”;
img.Save(dir + “ocrtest.jpg”);
ProcessStartInfo info = new ProcessStartInfo(@“C:\Program Files (x86)\Tesseract-OCR\tesseract.exe”);
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @“C:\temp\ocrtest.jpg C:\temp\out hocr”;
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@“C:\temp\out.hocr”);
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}

asad.ali · September 30, 2020, 7:13pm

@jmika99

We have logged an issue as PDFNET-48853 in our issue tracking system for further investigation against this error. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

BSchwab · December 9, 2020, 9:04am

@asad.ali

Why tessearct is used and not aspose.ocr? Do you have an example with aspose.ocr?

asad.ali · December 9, 2020, 6:48pm

@BSchwab

We already have an investigation ticket i.e. PDFNET-46139 logged for this purpose. We will surely investigate and prepare some functionality using both Aspose.PDF and Aspose.OCR to serve the purpose. You will surely receive an update here in this forum thread as soon as the ticket is resolved. Please be patient and give us some time.

We are sorry for the inconvenience.

jmika99 · January 23, 2021, 11:29pm

Hello, can I get a status on issue PDFNET-48853?

-Thanks

asad.ali · January 25, 2021, 6:39pm

@jmika99

We are afraid that earlier logged issue is not yet resolved. We will surely inform you as soon as we have some certain news about its fix. Please give us some time.

We are sorry for the inconvenience.

betovillalobos · March 16, 2021, 4:12pm

Hello,

I am trying to create a Searchable PDF using the callback sample in this thread, but the callback is not working. When debugging the original code, as it is in the thread sample it is curious that the debugger doesn’t go into the CallBackGetHocr method neither stops when I set a stop point at any line inside this method. I have copied the same code from CallBackGetHocr to a new method “ProduceOCR” and it works very well, we get the out.hocr file, but I don’t know how to use the resulting text and call the Document.Convert without the callback invocation.

I am using Visual Studio 2019.

asad.ali · March 16, 2021, 10:53pm

@betovillalobos

If you are able to generate the .hocr file successfully, you can please try using the below code snippet in order to create a searchable PDF document. Please share your sample PDF and .hocr file with us in case you still face any issue:

using (var pdf = new Aspose.Pdf.Document(dataDir + @"Laga 1 bis.pdf"))
{
 pdf.Convert((image) =>
 {
  return File.ReadAllText(dataDir + @"Laga 1-p1.hocr");
 });
 pdf.Save(dataDir + "test_searchable.pdf");
}

betovillalobos · April 21, 2021, 7:07pm

Thank you, it worked!

BSchwab · August 4, 2021, 12:51pm

Hello, how is the current status on this topic? Is there a time horizon for the feature?

asad.ali · August 4, 2021, 5:48pm

@BSchwab

There are three different tickets linked with this thread. Can you please point out about which you are inquiring? We will share our feedback with you accordingly.

BSchwab · August 11, 2021, 6:38am

We dont want to use tesseract.exe, it would be nice if the the “create searchable pdf feature” would be included in aspose.pdf or aspose.ocr. I guess its PDFNET-46139.

Other APIs have this feature (creating nice ocr for pdf files without using the external tesseract.exe). We are waiting for this feature in Aspose…

asad.ali · August 11, 2021, 6:24pm

@BSchwab

We definitely intend to provide this feature however, we are not certain when this is will be available as it is quite a complex feature and needs new components to be included in the API. Anyways, we have recorded your concerns and will definitely inform you once we make significant progress towards resolution of the issue. Please spare us some time.

BSchwab · September 8, 2023, 7:47am

Two years later … its still open?

asad.ali · September 8, 2023, 4:51pm

@BSchwab

We sincerely apologize for the delay in resolving your issue and the inconvenience it has caused you. We understand your frustration and we appreciate your patience and loyalty.

We want to assure you that your issue is important to us and we are working hard to find a solution as soon as possible. We have also escalated your issue to the next level of priority. We will surely inform you as soon as we have some definite updates about tickets’ resolution. We again apologize for the inconvenience.

BSchwab · January 26, 2024, 10:27am

@asad.ali

Hello,
I would like to ask again what the current status of the “Create searchable PDFs” feature is.

I found the following “advertisement” on the Aspose website - it sounds like the feature is already available and working: Aspose.OCR Scanned PDF to text for .NET | Aspose

I have tested the code, but the PDF that is created looks broken, however the text was recognized well. I could not get a satisfactory result with any of my (very simple) test pdfs

For Example
Non OCR source PDF (created in Word): no_ocr_word.pdf (31.0 KB)

Aspose Result: result.pdf (121.3 KB)

asad.ali · January 26, 2024, 6:02pm

@BSchwab

This particular feature has always been challenging because of the vast varieties in PDF format structure. It does work with many PDF documents successfully, but chances for it not creating expected results are always there because PDF can have different structure and arrangement of elements.

Nevertheless, we also noticed the issue with Aspose.OCR for .NET in our environment and have logged a ticket as OCRNET-785 in our issue tracking system to rectify it. We will surely inform you once investigation is complete and we have some feedback to share with you in this regard. We apologize for the inconvenience caused.