How to OCR a PDF file to allow user to select a text

Laksh · July 22, 2016, 5:02pm

I have a pdf file in which every page is an image. (Actually our client gets a document via postal mail, they scan that document through scanner and scanner produces a PDF. Client provide this PDF to us. Now even though it’s a pdf file, in reality every page of this file is an image. ) I want to provide copy and paste functionality to user where user can select text in PDF, copy it and paste it. However since the page is an image, a user cannot select a text.

I think I have to OCR the page, but I don’t want to extract all the text from the page. I just want to allow user to select text, and then copy & paste selected text.

What are my options using Aspose APIs

I have attached a sample pdf page

ikram.haq · July 25, 2016, 1:46am

Hi Laxmikant,

Thank you for your inquiry and providing sample file.

Aspose.OCR API can only accept images to perform OCR operation on them. If the requirement is to perform OCR on PDF documents then two Aspose APIs will be used to achieve the ultimate goal, that is; Aspose.Pdf API to convert the PDF pages to images and Aspose.OCR API to perform the OCR operation on the extracted/converted images. For details on how to perform OCR operation on PDF document please visit the URL Performing OCR on PDF Documents.

Once you have the image, now you can use Recognition Block to extract information from a particular area of the image. Please visit the following links for details:

Hope the above information helps. Feel free to reach us in case you have any query or comments.

Laksh · July 25, 2016, 10:13am

If you check the attached pdf, you wont be able to select a text and copy.

So I am not looking to extract text automatically. I want user to be able to select a text on the pdf and right click -> copy ->paste.

How do I do that?

ikram.haq · July 25, 2016, 1:42pm

Hi Laxmikant,

Thank you for writing us back.

This is to update you that the functionality/tool you are looking for is not available. As described in my last post that you can achieve this functionality by extracting images from PDF document using Aspose.Pdf API. Once you have the image try to select a part of image in the form of Rectangle (x, y, width, height) and then try Custom Recognition Block to perform OCR operation.

muhammad.ijaz · July 26, 2016, 12:50am

Hi Laxmikant,

Adding more to Ikram's comments, you want to convert a scanned PDF to a searchable PDF. Please check https://forum.aspose.com/t/9544 for more details on how to convert a scanned PDF to a searchable PDF.

You can use tesseract as well as Aspose.OCR to extract OCR text from images in the above mentioned code.

Best Regards,

Laksh · August 2, 2016, 4:28pm

Hi

I have gone through the link you provided however I'm getting Could not find file 'c:\pdftest\out.html' error. I have posted my questions here

ikram.haq · August 3, 2016, 11:57am

Hi Laxmikant,

Please access the link from Aspose.Pdf support forum for details. My fellow colleague has replied your inquiry.