Determine if PDF Page is OCRd using Aspose.PDF for .NET

M_Sowbhagya · February 19, 2020, 1:27pm

Hello,
How to recognize in Aspose the page is OCR? I would like to know page has searchable text, image or OCR, hidden text.
Please let me know how to solve this issue.

Thanks,
Sow

asad.ali · February 19, 2020, 5:46pm

@M_Sowbhagya

We need to investigate about this requirement and for that, we need a sample PDF document from you. Would you kindly provide a sample PDF document so that we can further proceed accordingly.

M_Sowbhagya · February 20, 2020, 6:58am

Thanks for the reply. Test.pdf (2.2 MB)
I have uploaded test file which contains scanned/OCR/Searchable text pages. I would like to how can I find out this page is OCR, scanned or searchable text using Apose.

Thanks,
Sow

asad.ali · February 20, 2020, 5:22pm

@M_Sowbhagya

Thanks for sharing sample PDF.

Currently Aspose.PDF provides a feature to detect whether a PDF has text or images only. As far as functionality to determine hidden text presence is not present at the moment. We have logged a feature request as PDFNET-47746 in our issue tracking system for the sake of implementation. We will investigate feasibility of the feature and keep you posted with the status of its availability. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · April 27, 2020, 6:07pm

@M_Sowbhagya

We have investigated the issue. Function for detecting invisible text is already implemented. Please use TextFragmentAbsorber class + TextState.RenderingMode property.

Please consider the following code:

Document doc = new Document(dataDir + @"Test.pdf");
foreach (Page page in doc.Pages)
{
    bool isContainImages = false;
    bool isContainVisibleText = false;
    bool isContainInvisibleTextUnderImages = false;

    //Search for visible text
    TextFragmentAbsorber allTextAbsorber = new TextFragmentAbsorber();
    page.Accept(allTextAbsorber);
    foreach (TextFragment textFragment in allTextAbsorber.TextFragments)
    {
        if (!String.IsNullOrEmpty(textFragment.Text)
            && textFragment.Text.Trim().Length > 0
            && textFragment.TextState.RenderingMode != TextRenderingMode.Invisible)
        {
            isContainVisibleText = true;
            Console.WriteLine("Page {0} contains a visible text", page.Number);
            break;
        }
    }
    if (!isContainVisibleText)
        Console.WriteLine("Page {0} contains no visible text", page.Number);

    //Search for images and invisible text
    ImagePlacementAbsorber imageAbsorber = new ImagePlacementAbsorber();
    imageAbsorber.Visit(page);
    if (imageAbsorber.ImagePlacements.Count > 0)
        isContainImages = true;
    Console.WriteLine("Page {0} contains {1} images", page.Number, imageAbsorber.ImagePlacements.Count);
    foreach (ImagePlacement imagePlacement in imageAbsorber.ImagePlacements)
    {
        TextSearchOptions options = new TextSearchOptions(imagePlacement.Rectangle);
        TextFragmentAbsorber onImageTextAbsorber = new TextFragmentAbsorber();
        onImageTextAbsorber.TextSearchOptions = options;
        page.Accept(onImageTextAbsorber);
        foreach (TextFragment onImageTextFragment in onImageTextAbsorber.TextFragments)
        {
            if (!String.IsNullOrEmpty(onImageTextFragment.Text)
                && onImageTextFragment.TextState.RenderingMode == TextRenderingMode.Invisible)
            {
                isContainInvisibleTextUnderImages = true;
                Console.WriteLine("Page {0} contains an invisible text on the image at {1}", page.Number, imagePlacement.Rectangle);
                break;
            }
        }
    }

    if (isContainInvisibleTextUnderImages)
        Console.WriteLine("Page {0} looks like OCR'd!", page.Number);
    else if (isContainImages && !isContainVisibleText)
        Console.WriteLine("Page {0} contains image(s) only!", page.Number);
    else if (!isContainImages && isContainVisibleText)
        Console.WriteLine("Page {0} contains common text only!", page.Number);
    else if (isContainImages && isContainVisibleText)
        Console.WriteLine("Page {0} contains mixed content (common text and images)!", page.Number);
    else
        Console.WriteLine("Page {0} contains neither common text nor raster images (may be empty or contain vector graphics)!", page.Number);

    Console.WriteLine("---");