Making PDF searchable

Hi,

I have an image PDF file. Can I make that PDF searchable (image above text) using Aspose OCR products.

Regards,

Shama

@shamatungare

You may perform OCR operation on PDF document by following the code snippet given in following article of API Documentation:

Hi,

Thank you for your response. I had already seen this example. This example does not make the PDF searchable. It just extracts the text from it. (Console.WriteLine(ocrEngine.Text)). Please share any example where PDF is made searchable.

Regards,

Shama

@shamatungare

Regretfully, Aspose.OCR does not provide functionality to create searchable PDF documents. However, you can convert a non-searchable PDF into searchable PDF document by using following code snippet and Aspose.PDF for .NET.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition, you may use outer OCR that supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it into your computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.

Asad, does it mean on online APIs Aspose does not use Aspose.OCR exclusively but in combination with other third-party software?

@lion.brotzky

We are collecting the information related to Aspose.OCR online App and will get back to you shortly.

@lion.brotzky

  1. Our Web API (Free Online App) uses the Cloud SDK. So if you want to get the same result and have the same functionality - you must use the OCR Cloud SDK (Cloud OCR API for on-premise and online solutions)
  2. Cloud SDK uses the model that allows you to recognize tables and receipts. We plan to include this model in the downloadable version, but we’re not sure we can do it any sooner