Remove/Delete text from PDF document using Aspose.PDF for .NET

Balram · February 18, 2014, 11:51pm

Sure, here is the cleaned-up version of your markdown and the corrected code:

We want to remove text from a PDF by converting the content to a neat and clean Markdown format. The following code doesn’t work as expected:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"");'

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

pdfDocument.Pages.Accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment textFragment in textFragmentCollection)

{
    textFragment.Text = "";
}

Dear sir, I wrote this code on a button click, but it is not working. You can test it on any PDF file in which text is selectable.

If your goal is to remove text from a PDF, you might need to take into account that simply clearing the Text property of TextFragment objects may not be enough. You might need to use the Aspose.Pdf.Text.Document.Page class and remove the actual text block objects.

Here is an updated version of your code:

using Aspose.Pdf;
using Aspose.Pdf.Text;

public void RemoveTextFromPdf(string pdfPath)
{
    // Load the PDF document
    Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document();

    // Create a TextFragmentAbsorber object and accept the pages
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

    foreach (Page page in pdfDocument.Pages)
    {
        page.Accept(textFragmentAbsorber);
    }

     // Get the text fragments
     TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

    // Iterate through the text fragments
    foreach (TextFragment textFragment in textFragmentCollection)
    {
         // Remove the text fragment from the PDF document
         textFragment.Parent.IsOpaque = true;
    }
}

You can call this method and pass the path of your PDF file to remove the text. Ensure you have the necessary using directives and that the Aspose.PDF library is properly referenced in your project.

codewarior · February 19, 2014, 2:33pm

Hi Balram,

Can you please share the source PDF file causing this problem so that we can test the scenario at our end. We are sorry for this inconvenience.

PS, BTW, the TextFragmentAbsorber(“Figure”); constructor should contain some string or regular expression which you need to search in PDF document.

Balram · March 4, 2014, 12:05am

Hi,

Please find the attached pdf file for remove text layer.

Kind Regards

Balram Awasthi

codewarior · March 4, 2014, 11:22am

Hi Balram,

In order to replace all the TextFragment inside the PDF document, you need to first search each TextFragment using Regular Expression and then replace it with blank character. However, when using 2.pdf, I am afraid all TextFragments are not being replaced with blank character. For the sake of correction, I have logged this problem as PDFNEWNET-36510 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us a little time. We are sorry for this inconvenience.

[C#]

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"c:/pdftest/2.pdf");

// search all TextFragments
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
pdfDocument.Pages.Accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment textFragment in textFragmentCollection)
{
    textFragment.Text = "";
}

pdfDocument.Save("c:/pdftest/TextRemoved.pdf");

aspose.notifier · May 13, 2018, 8:55pm

The issues you have found earlier (filed as PDFNET-36510) have been fixed in Aspose.PDF for .NET 18.5. This message was posted using BugNotificationTool by asad.ali

asad.ali · June 3, 2020, 9:04pm

@Balram

Adding more to our previous responses, please check following code snippet which uses much faster approach to remove text from PDF document:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");

// Used text showing operators
Operator[] operators = new Operator[]
{
 new Operator.ShowText(),
 new Operator.SetGlyphsPositionShowText(new List()),
 new Operator.MoveToNextLineShowText(),
 new Operator.SetSpacingMoveToNextLineShowText(0,0,""),
};

foreach (Page page in pdfDocument.Pages)
{
 ArrayList list = new ArrayList();
 OperatorCollection pageOperators = page.Contents;

 foreach (Operator op in operators)
 {
    OperatorSelector operatorSelector = new OperatorSelector(op);
    pageOperators.Accept(operatorSelector);
    list.AddRange(operatorSelector.Selected);
 }
 pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");

weibanban · August 6, 2020, 7:35am

Dear Ali

in the latest release of v 20.7.0.0. The above code snippet seems changes. Do you have the updated copy?

Best Regards
Jeff W

asad.ali · August 6, 2020, 9:00pm

@weibanban

The main change in latest version is namespace Aspose.Pdf.Operators. Please check following updated code snippet:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");
// Used text showing operators
Operator[] operators = new Operator[]
{
  new Aspose.Pdf.Operators.ShowText(),
  new Aspose.Pdf.Operators.SetGlyphsPositionShowText(new List<Aspose.Pdf.Operators.GlyphPosition>()),
  new Aspose.Pdf.Operators.MoveToNextLineShowText(),
  new Aspose.Pdf.Operators.SetSpacingMoveToNextLineShowText(0,0,""),
};

foreach (Page page in pdfDocument.Pages)
{
 ArrayList list = new ArrayList();
 OperatorCollection pageOperators = page.Contents;

 foreach (Operator op in operators)
 {
  OperatorSelector operatorSelector = new OperatorSelector(op);
  pageOperators.Accept(operatorSelector);
  list.AddRange(operatorSelector.Selected);
 }
 pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");

weibanban · August 7, 2020, 4:57am

Dear Ali

Thanks very much!

one more question. is there anyway we can set regular expression, or something else we can remove specific text, not all text from the given pdf document?

Best Regards
Jeff W

asad.ali · August 7, 2020, 3:57pm

@weibanban

Yes, it is possible. Please consider using following code snippet:

var textFragmentAbsorber = new TextFragmentAbsorber(@"(?<TM>\w*ex\w*)");
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "test.pdf");
pdfDocument.Pages.Accept(ta); // you can also loop through pages to search text on page level for low memory consumption
var textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
 textFragment.Text = String.Empty;
}

aurimas · September 7, 2020, 8:36am

Hey, Ali,

I was trying to remove text from a pdf and I get an exception on deleting operators, see the exception in the image attached. image.png (3.9 KB)
the file that causes the exception:
orig_6.pdf (270.6 KB)

and the code:

private static readonly Operator[] TextOperators =
{
new Aspose.Pdf.Operators.ShowText(),
new Aspose.Pdf.Operators.SetGlyphsPositionShowText(new List<Operators.GlyphPosition>()),
new Aspose.Pdf.Operators.MoveToNextLineShowText(),
new Aspose.Pdf.Operators.SetSpacingMoveToNextLineShowText(0, 0, string.Empty)
};

private static void RemoveTextFromPdfPage(Page pdfDocumentPage)
{
List operators = new List();
OperatorCollection pageOperators = pdfDocumentPage.Contents;
            foreach (Operator @operator in TextOperators)
            {
                OperatorSelector operatorSelector = new OperatorSelector(@operator);
                pageOperators.Accept(operatorSelector);
                operators.AddRange(operatorSelector.Selected);
            }

            pageOperators.Delete(operators);
    }

aurimas · September 7, 2020, 8:58am

Then I tried us ing the method to set text fragments to empty strings, unfortunetely, that method has it’s own bugs too. It leaves out selectable text. Try the source file I am uploading: orig_1.pdf (674.7 KB)

Use the file with this code:
 
 private static void RemoveTextFromPdfPage(Page pdfDocumentPage)
         {
                 TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");
                 textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
                 pdfDocumentPage.Accept(textFragmentAbsorber);
                 TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
                 foreach (TextFragment textFragment in textFragmentCollection)
                 {
                     textFragment.Text = string.Empty;
                 }
         }

the outcome is a file with selectable words in the table : “digits”, “N/A”, also page numbers “1/7” are not removed although left side of the page is cleared, and similar issue with top of the page where “en-English” is left selectable while left of that “NL-Netherlands” fragment was replaced. see the output file I am receiving: result_textFragments.pdf (675.4 KB)

also, i have just checked other pages of the same file and rest of them also have “en-English” at the top page and page number/of pages at the bottom selectable.

asad.ali · September 7, 2020, 6:49pm

@aurimas

We tested the scenario using Aspose.PDF for .NET 20.9 and were able to notice the exception in our environment. Therefore, we have logged it as PDFNET-48740 in our issue tracking system. We will further look into its details and keep you informed with the status of its correction. Please be patient and spare us some time.

The recommended approach to remove the text from a PDF document is using operators. We tried the same code snippet with your this file and 20.9v. We were unable to notice any issue as all text was removed successfully. Please check the attached PDF document for your kind reference. TextRemoved_operators_20.9.pdf (674.1 KB)

aurimas · September 8, 2020, 6:45am

Ali,

Would it help to speed things up if someone posted this issue through paid support?

asad.ali · September 8, 2020, 5:17pm

@aurimas

Yes, reporting the issue in paid support would escalate its priority and it will have precedence over the issues logged under free support model. If you have a paid support subscription, you can create a ticket there with the reference of issue ID so that it can be escalated accordingly.

aurimas · September 9, 2020, 6:45am

Thanks, will do so!

Valdemarus · October 6, 2020, 9:34am

Best time of the day to You, dear Ali!
Used the code below to remove all text from several pages of PDF document:
Operator[] operators = new Operator[]
{
new ShowText(),
new SetGlyphsPositionShowText(new List()),
new MoveToNextLineShowText(),
new SetSpacingMoveToNextLineShowText(0,0,“”),
};
List list = new List();

        OperatorCollection pageOperators = page.Contents;
        foreach (Operator op in operators)
        {
            OperatorSelector operatorSelector = new OperatorSelector(op);
            pageOperators.Accept(operatorSelector);
            list.AddRange(operatorSelector.Selected);
        }
        pageOperators.Delete(list);

It works well for 13 pages of 18 of PDF document. Was stored in separate procedure. But after 13-n page appears such a notification:

System.ArgumentException: An element with such a key already exists
(An entry with this key already exists - translation from my language)
в System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
в System.Collections.Generic.SortedList2.Add(TKey key, TValue value) в Aspose.Pdf.OperatorCollection.#=zNaG7xQiZGBPB(IList1 #=zmexNZBA=)
в Aspose.Pdf.OperatorCollection.#=zZOBOkPM=(IList1 #=zmexNZBA=, #=zK9S$bkg= #=zCJf6AvQ=) в Aspose.Pdf.OperatorCollection.Delete(IList1 list)
в Pdf2DwgA.SelectPDFs.RemoveTextFromPage(Page OldPage) в C:\C# AutoCAD\V2019\PDF2DWG2019\Pdf2DwgA\SelectPDFs.cs:line 251

SelectPDFs.cs - name of my program. Very soon we shall obtain new ASPOSE PDF.NET version, will this bug be fixed?

Valdemarus · October 6, 2020, 9:47am

I bag Your pardon, but there was mistake in code:
Not
List list = new List();
but
List(Operator) list = new List(Operator)();
Naturally, instead of circular parentheses there are angular ones.

asad.ali · October 6, 2020, 5:48pm

@Valdemarus

Could you kindly share the sample PDF as well with us. We will test the scenario in our environment and share our feedback with you accordingly.

Valdemarus · October 8, 2020, 10:12am

Text in document is in Russian, and this is the deal - we need to remove it.HVAVdiagrams_RUS.pdf (888.0 KB)