Make a non searchable pdf searchable

smooney1234 · September 9, 2019, 7:44am

How do i make a non searchable pdf searhable in .net?

Also how to do i make sure all text is readable by OCR?

Some samples i have used not all all the text is readable

asad.ali · September 9, 2019, 5:20pm

Thanks for contacting support.

You can convert a non-searchable PDF into searchable PDF document by using following code snippet.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out.hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.hocr");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition you may use outer OCR supports HOCR standard (http://en.wikipedia.org/wiki/HOCR). We have used free google tesseract OCR in the above code snippet. Please install it on your computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that, you will have tesseract.exe console application.

In case the suggested approach does not still fit your requirement, please share your sample PDF document, so that we can log an investigation ticket accordingly.

smooney1234 · September 11, 2019, 10:09am

I cannot do it using this snippet. I have tried and there is no convert method in my version of apose 19.8. Plus it not exactly a clean way to do it.

smooney1234 · September 11, 2019, 10:34am

There is a tesseract nuget package you can install and use that instead. Is it possible to have some more helpful examples of integrating this with aspose in the api documentation?

asad.ali · September 11, 2019, 6:11pm

@smooney1234

Could you kindly share your sample PDF document with us. We will test the scenario in our environment and share our feedback with you accordingly.