Quick detection of a text layer inside the PDF

groupdocs · May 12, 2015, 2:58am

Hello there,

I have the next question. Some PDFs contain no text layers, only images and graphical glyphs. Other PDFs contain a text layer, which can be selected, copy-pasted, and also a search can be performed. I know, that Aspose.Pdf allows to detect and extract such text content with the "TextFragmentAbsorber" class. However, for very big PDFs this process takes a lot of time - you need to iterate over every page of a document in a cycle and apply a "TextFragmentAbsorber" instance for a single page, trying to grab something textual.

So I'm wondering, is there a way to perform a quick check in order to find out, does the particular PDF file contain a textual content at all or not? In case this would be possible, it would be a great opportunity to eliminate the page-by-page scanning with "TextFragmentAbsorber", when the PDF doesn't have text content at all.

Thanks in advance.

With best regards, Denis Gvardionov

tilal.ahmad · May 12, 2015, 11:59am

Hi Denis,

Thanks for your inquiry. Please check the following code snippet for detecting whether a PDF document has only images. Hopefully, it would help you to devise your logic.

Also, please pay attention that we’ve supplied the most simple way of defining image-only PDFs. The proposed code snippet uses a show-text operator to deduce that it is an image-only PDF. In general, there can be other rules for detecting image-only PDFs, and these can be defined using the DOM (i.e., by analyzing the page content).

bool HasOnlyImages(string filename)
{
    Document document = new Document(filename);
    OperatorSelector os;
    foreach (Page page in document.Pages)
    {
        os = new OperatorSelector(new Operator.ShowText());
        page.Contents.Accept(os);
        if (os.Selected.Count != 0)
            return false;
    }
    return true;
}

Best Regards,