OCRing PDF files on the fly

korgman · April 2, 2015, 6:40am

I’ve built a search engine around Lucene but we have many 1000s of PDFs which are just scanned with no text. Can you please tell me how I can OCR these ‘on-the-fly’ as looking at Aspose.OCR it only appears to process images, not files like PDF.

Can you advise if Aspose.PDF and Aspose.OCR can be made to work together to process PDFs on the fly, or is there a better solution

babar.raza · April 2, 2015, 12:13pm

Hi Barry,

Thank you for contacting Aspose support.

That is correct, Aspose.OCR APIs can perform OCR operation on images only, therefore you can either extract the images from the PDF or convert the PDF pages to images before feeding them to the OcrEngine for performing OCR operation. Please note, Aspose.OCR APIs work well with high resolution images having at least 300 DPI so if your PDF contains the images of recommended resolution then you can simply extract the images, otherwise, you have to set the recommend resolution while converting the PDF pages to images.

Please feel free to contact us back in case you face any difficulty.