How to read text from scanned PDF

Hi,

Currently I am reading barcodes from a scanned PDF using aspose barcode library.
My challenge is, i need to search for a specific word in that. If that word exists, then only i need to scan the barcodes. How can i search or read text in scanned pdf.

Please suggest.

Thanks,
Manjula

@manjularani,

Thanks for your query.

To find/search a specific word or text, you need to use Aspose.PDF API, see the document on how to search text in PDF document pages for your reference. So, you may use Aspose.PDF to find/search text in PDF document first. To read barcodes on PDF, you need to integrate with Aspose.PDF for Java API which will extract each image to be scanned/read by Aspose.BarCode for Java. See the document with example code on how to read barcode from PDF document for your reference.

HI Amjad,

I would like to avoid aspose.pdf library. Without this is there any alternative ??

@manjularani,

No, this is not possible without using Aspose.PDF or extracting images from the PDF. Aspose.BarCode only processes image formats, see the supported image formats for your reference.

You still need of some library to work with pdf. You can just type in Google something like this: “java pdf library” and select any which you like.

Or you can learn pdf specification, open file, manually find text tegs, decode then in correct character encoding and search the text
PDF 1.7

But to do this in easy way, you can just use Aspose.PDF.

However if you mean “read text from the image” on “read text in scanned pdf”, you need to recognize the image. For this you can use Aspose.OCR or Tesseract OCR

I tried Tesseract OCR, but it is returning all junk characters instead of proper text.

@manjularani,

You may try Aspose.OCR to extract text from images. Otherwise, you may browse internet and try to find some suitable one for your needs.