OCR and PDF

Parkoos · April 21, 2014, 2:15pm

I have a requirement to create a searchable PDF from a TIF image.

Is it possible to create a text file with the OCR engine and then add it as a layer to the pdf of the image? Pdf created with aspose.pdf of course.

Thanks in advance for you help.

Parkoos

babar.raza · April 22, 2014, 2:51am

Hi Parkoos,

Thank you for using Aspose products.

Yes, you can use Aspose.OCR for .NET API to extract text from Tiff image, and insert the extracted text into a PDF file while using the Aspose.Pdf for .NET API. Although, we are not certain about your requirement of “Adding a text layer on the PDF of image”. Please elaborate this part further so we could better assist you with it.

Regarding Aspose.OCR for .NET, please note, the API currently supports Verdana, Times New Roman, Courier New, Tahoma, Calibri & Arial fonts in Normal, Italic & bold styles, whereas supported languages are English, French, Spanish & Cyrillic. We would request you to download the latest version of Aspose.OCR for .NET 1.9.0 and it’s corresponding resource file to give the API a try on your end.

Parkoos · April 22, 2014, 12:11pm

Hi Babar,

What I need to accomplish is create a searchable PDF of a TIFF image.

From what I read about PDF, this is accomplished by adding a text layer (composed of OCR of the Image) over the Image layer in the PDF.

Hope that helps.

Thanks

Prakash

codewarior · April 22, 2014, 11:43pm

Hi Prakash,

Thanks for sharing the details.

I would like to share some details from Aspose.Pdf for .NET perspective. Please note that when creating searchable PDF file, you do not need to place text layer on top of image/TIFF placed inside the PDF. However as per your requirement, you may first place the TIFF image inside the PDF file and then perform OCR to create a searchable PDF document.

Please visit the following link for required information on How to Convert an Image to PDF and in order to convert image PDF to searchable PDF file, please note that following logic recognizes text from pdf images. For recognition, you may use outer OCR supports HOCR standard(hOCR - Wikipedia).

I have used free Google tesseract OCR(http://en.wikipedia.org/wiki/Tesseract_(software)). Please install it on your computer from tesseract-ocr · GitHub and after that you will have tesseract.exe console application.

Below you can see usage example:

private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @“c:\PdfTest”;
img.Save(dir + “test.jpg”);
ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);
info.WindowStyle= ProcessWindowStyle.Hidden;
info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}

public void Main{
Document doc = new Document(“Input.pdf”);
doc.Convert(CallBackGetHocr);
doc.Save(“output.pdf”);
}

kamineni_srinivasara · October 27, 2015, 7:53am

Hi Experts,

I have similar kind of problem where I need to OCR all my pdf document and extract an image from the pdf file.

I tried using the tesseract as in the above example but getting error not an image file at the line doc.Convert(CallBackGetHocr);

Please suggest is there any thing I can do to OCR all the PDF file.

Thanks in advance for the help.

Regards,

K.Srinivasarao.

muhammad.ijaz · October 28, 2015, 5:51am

Hi,

Can you please share your input PDF file for further analysis?

Best Regards,

codewarior · October 28, 2015, 6:47am

Hi Srinivasarao,

Thanks for contacting support.

I have tested the scenario using Aspose.Pdf for .NET 10.9.0 where I have used earlier stated code and I am unable to notice any issue when using Visual Studio 2010 over Windows 7 (x64). Can you please share if you are getting issue against particular PDF document or its appearing for all the files. And in case its occurring for some specific documents, please share the resource files and also please share some details regarding your working environment. We are sorry for your inconvenience.

kamineni_srinivasara · October 28, 2015, 8:12am

Hi Shahbazv,

Thank you for your reply.

I am attaching the PDF file for your reference, the file was scanned by the scanner after generating a barcode on the pdf.

My Working Environment is

Visualstudio 2010, .NET 4.0 Framework., Aspose.PDF version 8.3.1.0v, Aspose.Barcode 7.3.0.07.3.0.0

I was using ASPOSE.Barcode component and ASPOSE.Pdf for generating the pdf s.

I need to read the barcode image from the file to get the internal text. I tried directly without running the adobe pro's OCR then I was not able to read the barcode. Once running the adobe pro OCR I was able to read the barcode internal text.

So I thought of running OCR for all the files programatically before reading the barcode image. Please can you suggest me if this procedure is good approach.

Regards,

K.Srinivasarao.

codewarior · October 29, 2015, 4:22am

Hi Srinivasarao,

Thanks for sharing the details.

I have gone through the PDF document which you have shared in your last post and as per my observations, its an image PDF with text and BarCode image. Now as per my understanding from your last post, you need to read the BarCode information from earlier shared document. Please note that in order to read BarCode information, you need to first Convert PDF Pages to JPEG Images format or Extract Images from the PDF File using Aspose.Pdf for .NET and then use Aspose.BarCode for .NET for Barcode Recognition.

Furthermore, please note that in my last post, I shared the steps to convert non-searchable (image based) to Searchable PDF document. Should you have any further query, please feel free to contact.