Given a large number of PDFs, I need to be able to determine which ones contain either a single image or also contain textual data. i.e. whether they are just image-only PDFs or have been OCRed.
I have managed to do this using the PdfExtractor Class and Extract/GetText methods to see if there is any text.
However, the problem with this is that if it is a very large PDF with a lot of text, the ExtractText call can take several seconds and ideally I need to be able to do this faster.
So is there anyway to just extract the text from the first 1 or 2 pages, or extract only the first 10 characters? Or is there a way to just return whether any text exists at all?
Or is there some other/better way that I could be doing this?
Thanks for your help,