How can I convert scanned pdf to editable pdf in aspose.pdf?
You can convert a non-searchable PDF into searchable PDF document by using following code snippet and Aspose.PDF for .NET.
private static void CreateSearchablePDF(string dataDir)
{
Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/test_searchable.pdf");
}
static string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"E:\Data\";
img.Save(dir + "ocrtest.jpg");
///V3.02
System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
System.Diagnostics.Process p = new System.Diagnostics.Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"E:\data\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
Above logic recognizes text for PDF images. For recognition, you may use outer OCR that supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it into your computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.