How to delete the invisible objects in a pdf file with aspose.pdf for java

6.pdf (1.3 MB)
I have a lot of PPT converted PDF files. Some of the text is under the pictures and is obscured. In fact, these texts are useless. When I read the text content, I will read these redundant texts. Therefore, I hope there is a way to delete hidden elements in batches.the file like this.the invisible text is like “Epoxidharz-Systeme”,you can search it in the file.

@hilong

We tried to search the above phrase in the PDF by opening it in Adobe Reader. The Adobe Reader was unable to return any results. Can you please share how you are checking if this text exists in the PDF? Also, please share if all the PDFs have same text hidden in them?

I’m sorry that there was no search test. You can try to search for “epoxidharz”, and its location is shown in Figure 1 and Figure 2(Edit mode). There may be many hidden elements in each PDF document. But not all hidden elements are the same. So I hope there is a way to delete these invisible elements in batch. I tried to convert these PDF documents into HTML files and found that the hidden element attribute is hidden, but I want to know whether these elements also have hidden attribute in PDF. You can also try editing this attachment 6.pdf. The hidden texts is below the picture layer, you can try to find them in edit mode. If you have any questions, please reply me.1.jpg (95.0 KB)
2.jpg (78.2 KB)

@hilong

We tried to find and remove the invisible/hidden text from your PDF using below code snippet but could not get success:

Document pdfDocument = new Document(dataDir + @"6.pdf");

foreach (var page in pdfDocument.Pages)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);

 foreach (TextFragment fragment in absorber.TextFragments)
 {
  if(fragment.TextState.RenderingMode == TextRenderingMode.Invisible)
  {
   Console.WriteLine(fragment.Text);
   fragment.Text = String.Empty;
  }
 }
}

Therefore, an investigation ticket as PDFNET-50421 has been logged in our issue tracking system to further analyze this case. We will look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

Thank you for your reply and hope to hear from you soon to solve this problem.

1 Like

@hilong

Now property TextFragment.TextState.Invisible correctly shows hidden text.
To remove the hidden text, the following code snippet can be used:

var document = new Document(inputFile);
var textAbsorber = new TextFragmentAbsorber();

// This option can be used to prevent other text fragments from moving after hidden text replacement.
textAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

document.Pages.Accept(textAbsorber);

foreach (var fragment in textAbsorber.TextFragments)
{
    if (fragment.TextState.Invisible)
    {
        fragment.Text = "";
    }
}

document.Save(outputFile);