How to make pdf searchable?

How to make image pdf to searchable without changing the format?

other details:

VS2010

.net 3.5

Hi Vigil,


Thanks for your inquiry. I’m afraid currently searchable PDF is not supported with Aspose components. As Aspose.Ocr is not quite mature. We are facing some issue in text recognition accuracy and its coordinates. Our development team is working hard to fix these issue and investigating some new algorithms for the purpose.

As a workaround you can create a searchable PDF document form image using Aspose.Pdf with collaboration of some other OCR application supporting HOCR standards. You can use free google tesseract OCR. In first step please convert your image to PDF by following this documentation link and later can convert it to searchable PDF document as described following.

Please install google tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application.

Below you can see usage example:

[C#]

private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>

{<o:p></o:p>

string dir = @"c:\PdfTest";<o:p></o:p>

img.Save(dir + “test.jpg”);<o:p></o:p>

ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);<o:p></o:p>

info.WindowStyle= ProcessWindowStyle.Hidden;<o:p></o:p>

info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;<o:p></o:p>

Process p = new Process();<o:p></o:p>

p.StartInfo = info;<o:p></o:p>

p.Start();<o:p></o:p>

p.WaitForExit();<o:p></o:p>

StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);<o:p></o:p>

string text = streamReader.ReadToEnd();<o:p></o:p>

streamReader.Close();<o:p></o:p>

return text;<o:p></o:p>

}<o:p></o:p>

public void Main<o:p></o:p>

{<o:p></o:p>

Document doc = new Document(“Input.pdf”);<o:p></o:p>

doc.Convert(CallBackGetHocr);<o:p></o:p>

doc.Save(“output.pdf”);<o:p></o:p>

}


Please feel free to contact us for any further assistance.

Best Regards,