Free Support Forum - aspose.com

Making PDF searchable

Hi,

I have an image PDF file. Can I make that PDF searchable (image above text) using Aspose OCR products.

Regards,

Shama

@shamatungare

You may perform OCR operation on PDF document by following the code snippet given in following article of API Documentation:

Hi,

Thank you for your response. I had already seen this example. This example does not make the PDF searchable. It just extracts the text from it. (Console.WriteLine(ocrEngine.Text)). Please share any example where PDF is made searchable.

Regards,

Shama

@shamatungare

Regretfully, Aspose.OCR does not provide functionality to create searchable PDF documents. However, you can convert a non-searchable PDF into searchable PDF document by using following code snippet and Aspose.PDF for .NET.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition, you may use outer OCR that supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it to you computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.