Remove/Delete text from PDF document using Aspose.PDF for .NET

We would like to remove text from Pdf. The following code doesn’t work:

Aspose.Pdf.Document
pdfDocument = new Aspose.Pdf.Document(@"");<o:p></o:p>

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

pdfDocument.Pages.Accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment textFragment in textFragmentCollection)

{

textFragment.Text = “”;

}

dear sir i write this code on button click but it is not working. You can test it on any pdf file in which text is selectable.

Hi Balram,


Can you please share the source PDF file causing this problem so that we can test the scenario at our end. We are sorry for this inconvenience.

PS, BTW, the TextFragmentAbsorber(“Figure”); constructor should contain some string or regular expression which you need to search in PDF document.

Hi,


Please find the attached pdf file for remove text layer.

Kind Regards
Balram Awasthi

Hi Balram,


In order to replace all the TextFragment inside the PDF document, you need to first search each TextFragment using Regular Expression and then replace it with blank character. However when using 2.pdf, I am afraid all TextFragments are not being replaced with blank character. For the sake of correction, I have logged this problem as PDFNEWNET-36510 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.

[C#]

Aspose.Pdf.Document
pdfDocument = new Aspose.Pdf.Document(@“c:/pdftest/2.pdf”);<o:p></o:p>

// search all TextFragments

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");

textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;

pdfDocument.Pages.Accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment textFragment in textFragmentCollection)

{

textFragment.Text = "";

}

pdfDocument.Save(“c:/pdftest/TextRemoved.pdf”);

The issues you have found earlier (filed as PDFNET-36510) have been fixed in Aspose.PDF for .NET 18.5. This message was posted using BugNotificationTool by asad.ali

@Balram

Adding more to our previous responses, please check following code snippet which uses much faster approach to remove text from PDF document:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");

// Used text showing operators
Operator[] operators = new Operator[]
{
 new Operator.ShowText(),
 new Operator.SetGlyphsPositionShowText(new List()),
 new Operator.MoveToNextLineShowText(),
 new Operator.SetSpacingMoveToNextLineShowText(0,0,""),
};

foreach (Page page in pdfDocument.Pages)
{
 ArrayList list = new ArrayList();
 OperatorCollection pageOperators = page.Contents;

 foreach (Operator op in operators)
 {
    OperatorSelector operatorSelector = new OperatorSelector(op);
    pageOperators.Accept(operatorSelector);
    list.AddRange(operatorSelector.Selected);
 }
 pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");

Dear Ali

in the latest release of v 20.7.0.0. The above code snippet seems changes. Do you have the updated copy?

Best Regards
Jeff W

@weibanban

The main change in latest version is namespace Aspose.Pdf.Operators. Please check following updated code snippet:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");
// Used text showing operators
Operator[] operators = new Operator[]
{
  new Aspose.Pdf.Operators.ShowText(),
  new Aspose.Pdf.Operators.SetGlyphsPositionShowText(new List<Aspose.Pdf.Operators.GlyphPosition>()),
  new Aspose.Pdf.Operators.MoveToNextLineShowText(),
  new Aspose.Pdf.Operators.SetSpacingMoveToNextLineShowText(0,0,""),
};

foreach (Page page in pdfDocument.Pages)
{
 ArrayList list = new ArrayList();
 OperatorCollection pageOperators = page.Contents;

 foreach (Operator op in operators)
 {
  OperatorSelector operatorSelector = new OperatorSelector(op);
  pageOperators.Accept(operatorSelector);
  list.AddRange(operatorSelector.Selected);
 }
 pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");

Dear Ali

Thanks very much!

one more question. is there anyway we can set regular expression, or something else we can remove specific text, not all text from the given pdf document?

Best Regards
Jeff W

@weibanban

Yes, it is possible. Please consider using following code snippet:

var textFragmentAbsorber = new TextFragmentAbsorber(@"(?<TM>\w*ex\w*)");
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "test.pdf");
pdfDocument.Pages.Accept(ta); // you can also loop through pages to search text on page level for low memory consumption
var textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
 textFragment.Text = String.Empty;
}

Hey, Ali,

I was trying to remove text from a pdf and I get an exception on deleting operators, see the exception in the image attached. image.png (3.9 KB)
the file that causes the exception:
orig_6.pdf (270.6 KB)

and the code:

private static readonly Operator[] TextOperators =
{
new Aspose.Pdf.Operators.ShowText(),
new Aspose.Pdf.Operators.SetGlyphsPositionShowText(new List<Operators.GlyphPosition>()),
new Aspose.Pdf.Operators.MoveToNextLineShowText(),
new Aspose.Pdf.Operators.SetSpacingMoveToNextLineShowText(0, 0, string.Empty)
};

private static void RemoveTextFromPdfPage(Page pdfDocumentPage)
{
List operators = new List();
OperatorCollection pageOperators = pdfDocumentPage.Contents;

            foreach (Operator @operator in TextOperators)
            {
                OperatorSelector operatorSelector = new OperatorSelector(@operator);
                pageOperators.Accept(operatorSelector);
                operators.AddRange(operatorSelector.Selected);
            }

            pageOperators.Delete(operators);
    }

Then I tried us ing the method to set text fragments to empty strings, unfortunetely, that method has it’s own bugs too. It leaves out selectable text. Try the source file I am uploading: orig_1.pdf (674.7 KB)

Use the file with this code:
 
 private static void RemoveTextFromPdfPage(Page pdfDocumentPage)
         {
                 TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");
                 textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
                 pdfDocumentPage.Accept(textFragmentAbsorber);
                 TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
                 foreach (TextFragment textFragment in textFragmentCollection)
                 {
                     textFragment.Text = string.Empty;
                 }
         }

the outcome is a file with selectable words in the table : “digits”, “N/A”, also page numbers “1/7” are not removed although left side of the page is cleared, and similar issue with top of the page where “en-English” is left selectable while left of that “NL-Netherlands” fragment was replaced. see the output file I am receiving: result_textFragments.pdf (675.4 KB)

also, i have just checked other pages of the same file and rest of them also have “en-English” at the top page and page number/of pages at the bottom selectable.

@aurimas

We tested the scenario using Aspose.PDF for .NET 20.9 and were able to notice the exception in our environment. Therefore, we have logged it as PDFNET-48740 in our issue tracking system. We will further look into its details and keep you informed with the status of its correction. Please be patient and spare us some time.

The recommended approach to remove the text from a PDF document is using operators. We tried the same code snippet with your this file and 20.9v. We were unable to notice any issue as all text was removed successfully. Please check the attached PDF document for your kind reference. TextRemoved_operators_20.9.pdf (674.1 KB)

Ali,

Would it help to speed things up if someone posted this issue through paid support?

@aurimas

Yes, reporting the issue in paid support would escalate its priority and it will have precedence over the issues logged under free support model. If you have a paid support subscription, you can create a ticket there with the reference of issue ID so that it can be escalated accordingly.

Thanks, will do so!

Best time of the day to You, dear Ali!
Used the code below to remove all text from several pages of PDF document:
Operator[] operators = new Operator[]
{
new ShowText(),
new SetGlyphsPositionShowText(new List()),
new MoveToNextLineShowText(),
new SetSpacingMoveToNextLineShowText(0,0,“”),
};
List list = new List();

        OperatorCollection pageOperators = page.Contents;
        foreach (Operator op in operators)
        {
            OperatorSelector operatorSelector = new OperatorSelector(op);
            pageOperators.Accept(operatorSelector);
            list.AddRange(operatorSelector.Selected);
        }
        pageOperators.Delete(list); 

It works well for 13 pages of 18 of PDF document. Was stored in separate procedure. But after 13-n page appears such a notification:

System.ArgumentException: An element with such a key already exists
(An entry with this key already exists - translation from my language)
в System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
в System.Collections.Generic.SortedList2.Add(TKey key, TValue value) в Aspose.Pdf.OperatorCollection.#=zNaG7xQiZGBPB(IList1 #=zmexNZBA=)
в Aspose.Pdf.OperatorCollection.#=zZOBOkPM=(IList1 #=zmexNZBA=, #=zK9S$bkg= #=zCJf6AvQ=) в Aspose.Pdf.OperatorCollection.Delete(IList1 list)
в Pdf2DwgA.SelectPDFs.RemoveTextFromPage(Page OldPage) в C:\C# AutoCAD\V2019\PDF2DWG2019\Pdf2DwgA\SelectPDFs.cs:line 251

SelectPDFs.cs - name of my program. Very soon we shall obtain new ASPOSE PDF.NET version, will this bug be fixed?

I bag Your pardon, but there was mistake in code:
Not
List list = new List();
but
List(Operator) list = new List(Operator)();
Naturally, instead of circular parentheses there are angular ones.

@Valdemarus

Could you kindly share the sample PDF as well with us. We will test the scenario in our environment and share our feedback with you accordingly.

Text in document is in Russian, and this is the deal - we need to remove it.HVAVdiagrams_RUS.pdf (888.0 KB)