As per our understanding of the scenario, you are trying to redact only credit card number from your PDF Page. In order to do that, would you please try using following code snippet in your environment with Aspose.PDF for .NET 19.2. For you kind reference, an output file is also attached which was generated in our environment.
Document pdf = new Document(dataDir + "Credit card (input).pdf");
string regex = @"[0-9 ]{19}";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex, new TextSearchOptions(true));
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.None;
pdf.Pages.Accept(textFragmentAbsorber);
foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
RedactionAnnotation ra = new RedactionAnnotation(textFragment.Page, textFragment.Rectangle);
ra.FillColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Black);
ra.BorderColor = Aspose.Pdf.Color.Black;
textFragment.Page.Annotations.Add(ra);
ra.Color = Aspose.Pdf.Color.Black;
ra.Redact();
}
pdf.Save(dataDir + "redacted.pdf");
In case our assumptions are different than your original requirements, please share a bit more details of what you want to achieve. We will again test the scenario in our environment and address it accordingly.
But the regular expression “[0-9 ]{19}” doesn’t suit us, because we can have credit card numbers from 13 to 19 numbers and with different separators between groups of numbers.
Can do something so that the replacement by a regular expression does not take into account the spaces between?
Would you please confirm if second column i.e. cc_cvc in you PDF documents only contains 3 numbers. We will test the scenario accordingly and share our feedback with you.
Unfortunately, the neighboring columns can be different. Next to the credit cards can be phone numbers (11 digits) or other data (SSN, Zip-code) either on the left or on the right.
Please note that TextAbsorber extracts all text from PDF page retaining its formatting. Whereas, TextFragmentAbsorber extracts text in a way it was added in the PDF (i.e. in shape of multiple text fragments where each text fragment can contain more than one text segments). However, please check following regular expression(s):
[0-9-]{19}|[0-9 ]{19}|[0-9]{16} (It also includes empty spaces)
\b(\d{4})([ -]\s*)(\1\2\1\2\1)|\d{16}\b
Both above regular expressions identify following pattern of credit card numbers:
3333 3333 3333 3333
3333-3333-3333-3333
3333333333333333
You may surely append other expected variations of number patterns in any of the above expressions (e.g. after ‘|’ symbol) and use it to redact the found text with earlier shared code snippet. In case you still face any issue, please feel free to let us know.
Thank you for your proposed solution, but it does not fit. Because Maestro credit cards can have the format #### #### #### #### ### (4-4-4-4-3), which intersects with our situation when in the next column CVC code.
I understand correctly that there is no solution to this problem? Because TextAbsorber and TextFragmentAbsorber extract text differently.
By the way, we tried to apply various TextExtractionOptions, it affects the extraction in TextAbsorber, and it does not work in TextFragmentAbsorber.
Yes, TextAbsorber extracts text as single String Value and TextFragmentAbsorber extracts it at Text Fragment level.
It depends upon the scenario where and how you are using these options. If you could please share some sample code snippet that you have tried and observed an issue, it would help us testing the scenario in our environment and address it accordingly.
Please use following regular expression in that case, which recognizes previously mentioned number formations as well as the above one.