Make searchable pdf

Liberty14 · February 23, 2016, 7:35am

we have a aspos.pdf document that our secretary is trying to make searchable. It will not OCR. We use Nuance Converter professional at our firm. Can you assist

codewarior · February 24, 2016, 1:30pm

Hi Kathy,

Thanks for your interest in our API's.

In order to accomplish the requirement of converting Non-Searchable PDF file to Searchable PDF document, you can use Aspose.Pdf in collaboration with some other OCR application supporting HOCR standards. A free google tesseract OCR can be used. So as described below, one can convert non-searchable PDF to searchable PDF document as described below. Once can install google tesseract OCR on his computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application.

Below you can see usage example:

[C#]

public void Main

{

Document doc = new Document("Input.pdf");

doc.Convert(CallBackGetHocr);

doc.Save("output.pdf");

}

private string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"c:\PdfTest\";

img.Save(dir + "test.jpg");

ProcessStartInfo info = new ProcessStartInfo(@"tesseract");

info.WindowStyle= ProcessWindowStyle.Hidden;

info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}

In case you encounter any issue, please share the resource PDF document, so that we can further look into this matter.