Read text from image in pdf

jvinoth · September 18, 2015, 6:04am

Hello,

Please find attached pdf, which have ‘image text’ within the red box at the top of the pdf. I have to read the text from the red box in the attached pdf but that part is image.

Is it possible using aspose dll if so please send the trial dll to me and sample code, if its works we will purchase your product.

Reply me as soon as possible. Very urgent.

Contact

Reach me @ vinoth.j@changepond.com

Thanks,
Vino.Jv

tilal.ahmad · September 21, 2015, 12:52am

Hi Vino,

Thanks for your inquiry. You may accomplish your requirements in two steps:

Create a searchable PDF document from an image using Aspose.Pdf with the collaboration of some other OCR application supporting HOCR standards.
Later extract text from a specified PDF page region using Aspose.Pdf.

For converting an image to a searchable PDF document, you can use free Google Tesseract OCR for the purpose. First, convert your image to a PDF, and later convert it into a searchable PDF document as described below. Please install Google Tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have the tesseract.exe console application. Below you can see a usage example.

Moreover, I am sorry to update you that the callback method used to convert a PDF to a searchable PDF document is malfunctioning in the current Aspose.Pdf version. However, the issue has been resolved in the upcoming release i.e. 10.9.0. It will be published at the start of October 2015. We have linked your post to the released issue id (PDFNEWNET-38495), you will be notified as soon as it is published.

[C#]

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");

    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";

    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();

    StreamReader streamReader = new StreamReader(@"E:\data\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

public void Main()
{
    Aspose.Pdf.License license = new Aspose.Pdf.License();
    license.SetLicense("E:/Data/AsposeLicense/asposetotal/Aspose.Total.lic");

    Document doc = new Document();
    Page page = doc.Pages.Add();
    Aspose.Pdf.Image image = new Aspose.Pdf.Image();
    image.File = "E:/Data/invoice13.jpg";
    page.Paragraphs.Add(image);

    MemoryStream ms = new MemoryStream();
    doc.Save(ms);
    doc = new Document(ms);
    doc.Convert(CallBackGetHocr);
    doc.Save("E:/Data/invoice13.jpg_output.pdf");
}

Please feel free to contact us for any further assistance.

Best Regards,

aspose.notifier · October 2, 2015, 6:32am

The issues you have found earlier (filed as PDFNEWNET-38495) have been fixed in Aspose.Pdf for .NET 10.9.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.