We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Make searchable pdf

we have a aspos.pdf document that our secretary is trying to make searchable. It will not OCR. We use Nuance Converter professional at our firm. Can you assist

Hi Kathy,

Thanks for your interest in our API's.

In order to accomplish the requirement of converting Non-Searchable PDF file to Searchable PDF document, you can use Aspose.Pdf in collaboration with some other OCR application supporting HOCR standards. A free google tesseract OCR can be used. So as described below, one can convert non-searchable PDF to searchable PDF document as described below. Once can install google tesseract OCR on his computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application.

Below you can see usage example:

[C#]

public void Main

{

Document doc = new Document("Input.pdf");

doc.Convert(CallBackGetHocr);

doc.Save("output.pdf");

}


private string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"c:\PdfTest\";

img.Save(dir + "test.jpg");

ProcessStartInfo info = new ProcessStartInfo(@"tesseract");

info.WindowStyle= ProcessWindowStyle.Hidden;

info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}

In case you encounter any issue, please share the resource PDF document, so that we can further look into this matter.