I have a pdf file in which every page is an image. (Actually
our client gets a document via postal mail, they scan that document through
scanner and scanner produces a PDF. Client provide this PDF to us. Now even though it’s a pdf file, in reality every
page of this file is an image. ) I want to provide copy and paste functionality
to user where user can select text in PDF, copy it and paste it. However since the
page is an image, a user cannot select a text.
I think I have to OCR the page, but I don’t want to extract all
the text from the page. I just want to allow user to select text, and then copy
& paste selected text.
Thank you for your inquiry and providing sample file.
Aspose.OCR API can only accept images to perform OCR operation on them. If the requirement is to perform OCR on PDF documents then two Aspose APIs will be used to achieve the ultimate goal, that is; Aspose.Pdf API to convert the PDF pages to images and Aspose.OCR API to perform the OCR operation on the extracted/converted images. For details on how to perform OCR operation on PDF document please visit the URL Performing OCR on PDF Documents.
Once you have the image, now you can use Recognition Block to extract information from a particular area of the image. Please visit the following links for details:
This is to update you that the functionality/tool you are looking for is not available. As described in my last post that you can achieve this functionality by extracting images from PDF document using Aspose.Pdf API. Once you have the image try to select a part of image in the form of Rectangle (x, y, width, height) and then try Custom Recognition Block to perform OCR operation.
Adding more to Ikram's comments, you want to convert a scanned PDF to a searchable PDF. Please check https://forum.aspose.com/t/9544 for more details on how to convert a scanned PDF to a searchable PDF.
You can use tesseract as well as Aspose.OCR to extract OCR text from images in the above mentioned code.