How to extract text from a scanned document created as PDF?

vinit.patel · November 12, 2013, 10:40am

Hi,

We have thousands of documents scanned as PDF via Xerox scanner. Now, we need to extract text out of these documents. Is there any way we could use either OCR or PDF api of Aspose to extract text?

Thanks,

Vinit

babar.raza · November 12, 2013, 11:56pm

Hi Vinit,

Thank you for considering Aspose products.

Currently, Aspose.OCR components (both flavors) can only process images to perform OCR on them. Ability to load PDF files is not supported at the moment. Although you can use a combination of Aspose.OCR for Java and Aspose.PDF for Java in order to achieve your goals. For instance, you can convert a PDF file to image using Aspose.PDF for Java, and extract text from the image using Aspose.OCR for Java API.

Please feel free to write back in case you face any difficulties.