Has a document been OCRed

Hi,

Given a large number of PDFs, I need to be able to determine which ones contain either a single image or also contain textual data. i.e. whether they are just image-only PDFs or have been OCRed.

I have managed to do this using the PdfExtractor Class and Extract/GetText methods to see if there is any text.

However, the problem with this is that if it is a very large PDF with a lot of text, the ExtractText call can take several seconds and ideally I need to be able to do this faster.

So is there anyway to just extract the text from the first 1 or 2 pages, or extract only the first 10 characters? Or is there a way to just return whether any text exists at all?

Or is there some other/better way that I could be doing this?

Thanks for your help,

Chris

Hello Chris,

PdfExtractor class have properties named StartPage and EndPage, which can be used to specify the starting page and the end page, from where you can extract text/image contents.

You can also use HasNextPageText() method to check, whether you can get more text or not. For more related information please visit PdfExtractor Members.