Has a document been OCRed

caustin · November 4, 2008, 11:34am

Hi,

Given a large number of PDFs, I need to be able to determine which ones contain either a single image or also contain textual data. i.e. whether they are just image-only PDFs or have been OCRed.

I have managed to do this using the PdfExtractor Class and Extract/GetText methods to see if there is any text.

However, the problem with this is that if it is a very large PDF with a lot of text, the ExtractText call can take several seconds and ideally I need to be able to do this faster.

So is there anyway to just extract the text from the first 1 or 2 pages, or extract only the first 10 characters? Or is there a way to just return whether any text exists at all?

Or is there some other/better way that I could be doing this?

Thanks for your help,

Chris

codewarior · November 4, 2008, 3:25pm

Hello Chris,

PdfExtractor class have properties named StartPage and EndPage, which can be used to specify the starting page and the end page, from where you can extract text/image contents.

You can also use HasNextPageText() method to check, whether you can get more text or not. For more related information please visit PdfExtractor Members.