Read text from image in pdf

Hello,<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" />

Please find attached pdf, which have 'image text' within the red box at the top of the pdf. I have to read the text from the red box in the attached pdf but that part is image.

Is it possible using aspose dll if so please send the trial dll to me and sample code, if its works we will purchase your product.

Reply me as soon as possible. Very urgent.

Reach me @

vinoth.j@changepond.com

Thanks,

Vino.Jv

Hi Vino,


Thanks for your inquiry. You may accomplish your requirements in two steps, create a searchable PDF document form image using Aspose.Pdf with collaboration of some other OCR application supporting HOCR standards and later extract text from specified PDF page region using Aspose.Pdf.

For converting Image to searchable PDF document.You can use free google tesseract OCR for the purpose. In first step please convert your image to PDF and later can convert it into searchable PDF document as described following. Please install google tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application. Below you can see usage example.

Moreover, I am sorry to update you that call back method used to convert PDF to searchable PDF document is malfunctioning in current Aspose.Pdf version. However issue has been resolved in upcoming release i.e. 10.9.0. It will be published in start of October, 2015. We have linked your post to the released issue id (PDFNEWNET-38495), you will be notified as soon as it is published.

[C#]

private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>

{<o:p></o:p>

string dir = @"E:\Data";

img.Save(dir + “ocrtest.jpg”);

ProcessStartInfo info = new
ProcessStartInfo(@“C:\Program
Files (x86)\Tesseract-OCR\tesseract.exe”
);

info.WindowStyle = ProcessWindowStyle.Hidden;

info.Arguments = @“E:\data\ocrtest.jpg
E:\data\out hocr”
;

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@“E:\data\out.html”);

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;


}

public void Main

{

Aspose.Pdf.License license = new Aspose.Pdf.License();

license.SetLicense("E:/Data/AsposeLicense/asposetotal/Aspose.Total.lic");

Document doc = new Document();

Page page = doc.Pages.Add();

Aspose.Pdf.Image image = new Aspose.Pdf.Image();

image.File = "E:/Data/invoice13.jpg";

page.Paragraphs.Add(image);

MemoryStream ms = new MemoryStream();

doc.Save(ms);

doc = new Document(ms);

doc.Convert(CallBackGetHocr);

doc.Save("E:/Data/invoice13.jpg_output.pdf");

}


Please feel free to contact us for any further assistance.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-38495) have been fixed in Aspose.Pdf for .NET 10.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.