PDF Text and Image segregation

vghodekar · November 12, 2024, 3:28pm

Encountered another issue with PDF processing and this only happens for certain documents not all.

The code below first extracts the text and then the images. For certain PDFs, we are noticing the exact same page content is being extracted as both text and then again as image. I examined the PDF and the pages are text (searchable). We are using OCR to scan the image, hence ending up with duplicate content.

Any guidance or insight will be helpful.
============== code snippet ==============
foreach (var pdfPage in pdfDocument.Pages)
{
StringBuilder pageTextSB = new StringBuilder();

if (pdfPage == null) { continue; }

pageNo++;

TextAbsorber textAbsorber = new TextAbsorber();

pdfPage.Accept(textAbsorber);

var paragraphs = new List<string>();

var pageText = textAbsorber.Text;
.....
.....

var xImages = pdfPage.Resources.Images.ToList();

foreach (var xI in xImages)
{
}

Professionalize.Discourse · November 12, 2024, 3:28pm

@vghodekar

It seems you are encountering an issue where the same content is being extracted as both text and images from certain PDF documents, leading to duplication when using OCR. This can happen due to the way some PDFs are structured, where text is embedded in a way that it can also be rendered as an image.

To address this issue, consider the following approaches:

Check for Text Extraction: Before performing OCR on the images, ensure that the text extraction process is correctly identifying and extracting all text. You can use the TextAbsorber class to extract text from each page, as you are currently doing. Make sure to log or inspect the extracted text to confirm that it is indeed being captured.
Image Extraction Logic: When extracting images, you might want to implement a check to see if the extracted images contain text that has already been captured. If the text is already extracted, you can skip the OCR process for that image. This can help prevent duplication.
Use of Tagged PDFs: If the PDFs you are working with are tagged, you might want to leverage the structured information available in tagged PDFs. This can enhance the accuracy of text extraction and reduce the chances of extracting the same content as both text and image.
Filtering Duplicates: After extracting both text and images, you can implement a filtering mechanism to remove duplicates from your final output. This can be done by comparing the extracted text with the content derived from OCR on images.

Here’s a simplified code snippet to illustrate how you might implement a check for duplicates:

foreach (var pdfPage in pdfDocument.Pages)
{
    StringBuilder pageTextSB = new StringBuilder();
    if (pdfPage == null) { continue; }

    TextAbsorber textAbsorber = new TextAbsorber();
    pdfPage.Accept(textAbsorber);
    var pageText = textAbsorber.Text;

    // Extract images
    var xImages = pdfPage.Resources.Images.ToList<XImage>();
    foreach (var xI in xImages)
    {
        // Perform OCR on the image
        var ocrText = PerformOCR(xI);
        
        // Check for duplicates
        if (!pageText.Contains(ocrText))
        {
            // Add ocrText to your results
        }
    }
}

By implementing these strategies, you should be able to minimize the duplication of content extracted from your PDFs. If the problem persists, consider reviewing the structure of the problematic PDFs to understand how the text and images are embedded.

For further reading on text extraction techniques and handling tagged PDFs, you may find the Aspose documentation helpful.

Sources:
[1]: PDF Text Extraction | Aspose.PDF Java PDF Processing API
[2]: Tagged PDF Content Extraction using Java | Aspose.PDF Java PDF …