Search and redact text from PDF using Aspose.PDF for .NET - Redaction is not working correctly

We are using Aspose.PDF (.NET 19.11 version) for searching and redacting Credit Card numbers.

Problem: The redaction is incomplete on two valid ccn what TextFragmentAbsorber found with the “(?:\d[ -]*){10,19}” regex.

The two ccn:

  • 10011 212-847-4915
  • 35209 205-276-1807

The redacted parts:

  • 10011 2
  • 35209 2

What I found is the Textfragment contains the whole ccn but for some reason, the segments contain a part of the number and the Textfragment Rectangle URX, URY positions are the same as the last TextSegment Rectangle URX, URY. So I cannot set the correct rectangle for the RedactionAnnotation.


  • TextFragment Text: 10011 212-847-4915
  • TextFragment Rectangle: 431.83,503.829999961853,455.577880077362,511.3540001297
  • First TextSegment Text: 10011
  • First TextSegment Rectangle: 431.83,503.829999961853,449.229900386482,511.3540001297
  • Second TextSegment Text: 2
  • Second TextSegment Rectangle: 452.11,503.829999961853,455.577880077362,511.3540001297

  • TextFragment Text: 35209 205-276-1807
  • TextFragment Rectangle: 431.83,395.799999961853,455.577880077362,403.3240001297
  • First TextSegment Text: 35209
  • First TextSegment Rectangle: 431.83,395.799999961853,449.229900386482,403.3240001297
  • Second TextSegment Text: 2
  • Second TextSegment Rectangle: 452.11,395.799999961853,455.577880077362,403.3240001297

Input file: sample-data.pdf (178.8 KB)

Output file:redacted-sample-data.pdf (228.3 KB)

Could you please help me with this problem?

Thanks,

Gabor

@erdeiga,

Can you please share source code so that we may further investigate to help you out. Also please try to use Aspose.PDF latest version 20.1 on your end before sharing requested information with us.

@Adnan.Ahmad
I tried with the latest version (20.1) but it didn’t solve the problem.

Document tempDocument = new Document(sourcePath);

foreach (Page actualPage in tempDocument.Pages)
{
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?:\d[ -]*){10,19}");
    textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
    actualPage.Accept(textFragmentAbsorber);

    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

    foreach (TextFragment textFragment in textFragmentCollection)
    {
        Rectangle rect = textFragment.Rectangle;
        RedactionAnnotation annot = new RedactionAnnotation(actualPage, new Rectangle(rect.LLX + 1.0, rect.LLY, rect.URX - 1.0, rect.URY))
        {
            FillColor = Aspose.Pdf.Color.Black,
        };

        actualPage.Annotations.Add(annot);
        annot.Redact();
    }
}

When I changed the regex to “(\d){4,5}( -{3}){2} -{4}” the TextFragment rectangle position is different from the previous one. What is the reason of this?

  • TextFragment Text: 10011 212-847-4915
  • TextFragment Rectangle: 431.83,503.829999961853,490.954360866547,511.3540001297
  • First TextSegment Text: 10011
  • First TextSegment Rectangle: 431.83,503.829999961853,449.229900386482,511.3540001297
  • Second TextSegment Text: 212-847-4915
  • Second TextSegment Rectangle: 452.11,503.829999961853,490.954360866547,511.3540001297

@erdeiga,

Can you please share your desired result so that we may further investigate to help you out.

@Adnan.Ahmad

Sorry, I accidentally misspelled the second regex. What I wanted to write is (\d){4,5}([ -](\d){3}){2}[ -](\d){4}

So why can’t I redact the whole 10011 212-847-4915 and 35209 205-276-1807 matches when I searching with the (?:\d[ -]*){10,19} regex? But with the (\d){4,5}([ -](\d){3}){2}[ -](\d){4} I can.

@erdeiga,

We are looking into this and will get back to you with feedback soon.

1 Like

Hi @Adnan.Ahmad,
Do you have any progress on this issue?

The black redaction box is meant to protect sensitive information from public view[!]

@erdeiga,

I have worked with source file shared by you and shared my generated result with you as well. For further investigation can you please share comparison screenshot along with desired result.111outputsample_10.pdf (261.9 KB)