Convert scanned pdf to editable pdf in aspose.pdf

koreaus1 · February 11, 2023, 10:58am

How can I convert scanned pdf to editable pdf in aspose.pdf?

asad.ali · February 11, 2023, 1:08pm

You can convert a non-searchable PDF into searchable PDF document by using following code snippet and Aspose.PDF for .NET.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition, you may use outer OCR that supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it into your computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.