We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

OCR on a PDF Page

Hi,

I have a pdf document in which there are pages which has text which can not be extracted when I use textdevice object. So I converted tat page to tiff and ran OCR on the image which gave me the text I was expecting. Is there a way to know if a page can only be read by running an OCR enging.

Regards,

Rajeev

Hi Rajeev,


Thanks for your inquiry. You may use following snippet to check whether PDF document has only images (scanned PDF). Also please pay attention that we’ve supplied the most simple way of defining image only PDFs. Proposed code snippet uses show text operator to deduce that it is image only PDF. In general there can be other rules of detecting image only PDFs and that can be defined using DOM (i.e. by analyzing pages content).

boolean HasOnlyImages(String filename)<o:p></o:p>

{

Document doc = new Document(filename);

OperatorSelector os;

for (int pageCount = 1; pageCount <= doc.getPages().size(); pageCount++)

{

Page page= doc.getPages().get_Item(pageCount);

os = new OperatorSelector(new Operator.ShowText());

page.getContents().accept(os);

if (os.getSelected().size() != 0)

return false;

}

return true;

}


Please feel free to contact us for any further assistance.


Best Regards,