Quick detection of a text layer inside the PDF

Hello there,

I have the next question. Some PDFs contain no text layers, only images and graphical glyphs. Other PDFs contain a text layer, which can be selected, copy-pasted, and also a search can be performed. I know, that Aspose.Pdf allows to detect and extract such text content with the "TextFragmentAbsorber" class. However, for very big PDFs this process takes a lot of time - you need to iterate over every page of a document in a cycle and apply a "TextFragmentAbsorber" instance for a single page, trying to grab something textual.


So I'm wondering, is there a way to perform a quick check in order to find out, does the particular PDF file contain a textual content at all or not? In case this would be possible, it would be a great opportunity to eliminate the page-by-page scanning with "TextFragmentAbsorber", when the PDF doesn't have text content at all.

Thanks in advance.

With best regards, Denis Gvardionov

Hi Denis,

Thanks for your inquiry. Please check the following code snippet for detecting whether a PDF document has only images. Hopefully, it would help you to devise your logic.

Also, please pay attention that we’ve supplied the most simple way of defining image-only PDFs. The proposed code snippet uses a show-text operator to deduce that it is an image-only PDF. In general, there can be other rules for detecting image-only PDFs, and these can be defined using the DOM (i.e., by analyzing the page content).

bool HasOnlyImages(string filename)
{
    Document document = new Document(filename);
    OperatorSelector os;
    foreach (Page page in document.Pages)
    {
        os = new OperatorSelector(new Operator.ShowText());
        page.Contents.Accept(os);
        if (os.Selected.Count != 0)
            return false;
    }
    return true;
}

Best Regards,