How to find not rendered characters in PDF?

Gerd · February 12, 2025, 8:10am

Hello Aspose,

in the screen shot above, you see not rendered characters in the PDF.
Is there a way to find such holes, e.g. with the TextAbsorber?

Regards
Gerd

Professionalize.Discourse · February 12, 2025, 8:10am

@Gerd

To detect not rendered or “invisible” characters in a PDF using Aspose.PDF, you can utilize the TextFragmentAbsorber class. This approach allows you to identify text fragments that are marked as invisible and take appropriate actions, such as removing them or replacing them with empty strings.

Here’s a sample code snippet that demonstrates how to achieve this:

// Load the PDF document
var document = new Document(inputFile);

// Create a TextFragmentAbsorber to find text fragments
var textAbsorber = new TextFragmentAbsorber();

// Set options to prevent other text fragments from moving after hidden text replacement
textAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

// Accept the absorber to process the document pages
document.Pages.Accept(textAbsorber);

// Iterate through the found text fragments
foreach (var fragment in textAbsorber.TextFragments)
{
    // Check if the text fragment is invisible
    if (fragment.TextState.Invisible)
    {
        // Replace the invisible text with an empty string
        fragment.Text = "";
    }
}

// Save the modified document
document.Save(outputFile);

In this code:

You load the PDF document and create an instance of TextFragmentAbsorber.
The TextReplaceOptions is set to prevent layout changes when replacing invisible text.
After processing the document, you check each text fragment to see if it is marked as invisible. If it is, you can replace it with an empty string or handle it as needed.
Finally, save the modified document.

This method should help you identify and manage invisible characters in your PDF files effectively.

If you encounter issues where the code does not detect invisible text in certain files, it may be beneficial to share those specific files for further investigation.

Sources:
[1]: How to Detect “Invisible” Characters

Gerd · February 12, 2025, 8:29pm

I tested your code on my document. It found many empty texts, but none of them was an unrendered Unicode character. Please see my demo-PDF.
Is there another way?

16ec3fd2-bd95-4653-bb93-3473dc055705.ko.pdf (2.3 MB)

asad.ali · February 12, 2025, 9:18pm

@Gerd

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-59288

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.