Make a non searchable pdf searchable


#1

How do i make a non searchable pdf searhable in .net?

Also how to do i make sure all text is readable by OCR?

Some samples i have used not all all the text is readable


#2

@smooney1234

Thanks for contacting support.

You can convert a non-searchable PDF into searchable PDF document by using following code snippet.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition you may use outer OCR supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it to you computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.

In case suggested approach does not still fit your requirement, please share your sample PDF document, so that we can log an investigation ticket accordingly.


#3

I cannot do it using this snippet. I have tried and there is no convert method in my version of apose 19.8. Plus it not exactly a clean way to do it.


#4

There is a tesseract nuget package you can install and use that instead. Is it possible to have some more helpful examples of integrating this with aspose in the api documentation?


#5

@smooney1234

Could you kindly share your sample PDF document with us. We will test the scenario in our environment and share our feedback with you accordingly.