How to ocr pdf and then save as pdf with text editable?

Hi, Support:

Does this dll support this feature? I want to know how to ocr pdf page and let its texts to be editable?

Thanks for your help.

@ducaisoft

You can convert a non-searchable PDF document into searchable PDF using Aspose.PDF. You can install Tesseract and use following code snippet to do so:

Document doc = new Document("D:/Downloads/input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/pdf_searchable.pdf");
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"E:\data\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

Please note that above code snippet will add a hidden layer of text over the images in PDF Pages so that one can search the text. However, it is not possible to make the images editable using Aspose.PDF.

Thank you very much for your help.
But there is an error in the Dev, how to fix it?
Please see the code where the err is labelled with red underline.
OCR.jpg (89.8 KB)

@ducaisoft

Please try to change the return type of the method CallBackGetHocr as String.

I try as you suggest, but the issue is still pending.

123.jpg (49.7 KB)

@ducaisoft

We apologize for the inconvenience.

Please note that AddressOf keyword is needed to use Callback method in VB as below:

doc.Convert(AddressOf callbackhocr)

Thanks!
That’s ok!

1 Like