Problem with the replacement of text on regular expressions

coderandom1 · February 25, 2019, 11:33am

There is a problem with the replacement of text on regular expressions.

When extracting all the text from the Credit card (input) file, we get:

cc_number                                  cc_cvc 
5270 4267 6450 5516                                                              123 
5370 4638 8881 3020                                                              713 
4916 9766 5240 6147                                                              258 
5180 3807 3679 8221                                                              612 
4929 3813 3266 4295                                                              911

From the test it is clear that between the columns cc_number and cc_cvc there are a lot of spaces.

But regular expression searches ignore them.

Listing of our replacement program:

Document pdfDocument = new Document(inputStream);
 
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
string extractedText = textAbsorber.Text;
 
string regex = @"\b(?:(?:\d{4}([ -]?)(?:\d{4}\1(?:\d{4}\1\d{4}(?:\1\d{3})?|\d{5})|\d{5}\1\d{6}|\d{6}\1\d{4}\d?|\d{7}\1\d{4}))|\d{6}[ -]?\d{13})\b";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex, new TextSearchOptions(true));
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.None;
pdfDocument.Pages.Accept(textFragmentAbsorber);
 
foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
char space = ' ';
double rectangleWeight = textFragment.Rectangle.Width;
textFragment.Text = space.ToString();
textFragment.Text = new string(space, (int)(rectangleWeight / textFragment.Rectangle.Width) + 1);
textFragment.TextState.BackgroundColor = Color.Black;
}
 
pdfDocument.Save(outputStream);

Tell me how can I fix this?

The source and output files in the attachment, also in the attachment file with the extracted text.

Credit card (input).pdf (18.8 KB)
Credit card (output).pdf (19.0 KB)

asad.ali · February 25, 2019, 7:02pm

@coderandom1

Thanks for contacting support.

As per our understanding of the scenario, you are trying to redact only credit card number from your PDF Page. In order to do that, would you please try using following code snippet in your environment with Aspose.PDF for .NET 19.2. For you kind reference, an output file is also attached which was generated in our environment.

Document pdf = new Document(dataDir + "Credit card (input).pdf");
string regex = @"[0-9 ]{19}";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex, new TextSearchOptions(true));
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.None;
pdf.Pages.Accept(textFragmentAbsorber);

foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
 RedactionAnnotation ra = new RedactionAnnotation(textFragment.Page, textFragment.Rectangle);
 ra.FillColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Black);
 ra.BorderColor = Aspose.Pdf.Color.Black;
 textFragment.Page.Annotations.Add(ra);
 ra.Color = Aspose.Pdf.Color.Black;
 ra.Redact();
}

pdf.Save(dataDir + "redacted.pdf");

Redacted.pdf (21.4 KB)

In case our assumptions are different than your original requirements, please share a bit more details of what you want to achieve. We will again test the scenario in our environment and address it accordingly.

coderandom1 · February 26, 2019, 9:17am

Thanks for your reply.

But the regular expression “[0-9 ]{19}” doesn’t suit us, because we can have credit card numbers from 13 to 19 numbers and with different separators between groups of numbers.

Can do something so that the replacement by a regular expression does not take into account the spaces between?

asad.ali · February 26, 2019, 6:08pm

@coderandom1

Thanks for your feedback.

Would you please confirm if second column i.e. cc_cvc in you PDF documents only contains 3 numbers. We will test the scenario accordingly and share our feedback with you.

coderandom1 · February 27, 2019, 9:23am

Unfortunately, the neighboring columns can be different. Next to the credit cards can be phone numbers (11 digits) or other data (SSN, Zip-code) either on the left or on the right.

asad.ali · February 27, 2019, 6:15pm

@coderandom1

Thanks for providing requested information.

coderandom1:

When extracting all the text from the Credit card (input) file, we get:
cc_number                                  cc_cvc 
5270 4267 6450 5516                                                              123 
5370 4638 8881 3020                                                              713 
4916 9766 5240 6147                                                              258 
5180 3807 3679 8221                                                              612 
4929 3813 3266 4295                                                              911  
From the test it is clear that between the columns cc_number and cc_cvc there are a lot of spaces.

But regular expression searches ignore them.

Please note that TextAbsorber extracts all text from PDF page retaining its formatting. Whereas, TextFragmentAbsorber extracts text in a way it was added in the PDF (i.e. in shape of multiple text fragments where each text fragment can contain more than one text segments). However, please check following regular expression(s):

[0-9-]{19}|[0-9 ]{19}|[0-9]{16} (It also includes empty spaces)
\b(\d{4})([ -]\s*)(\1\2\1\2\1)|\d{16}\b

Both above regular expressions identify following pattern of credit card numbers:

3333 3333 3333 3333
3333-3333-3333-3333
3333333333333333

You may surely append other expected variations of number patterns in any of the above expressions (e.g. after ‘|’ symbol) and use it to redact the found text with earlier shared code snippet. In case you still face any issue, please feel free to let us know.

coderandom1 · February 28, 2019, 9:11am

Thank you for your proposed solution, but it does not fit. Because Maestro credit cards can have the format #### #### #### #### ### (4-4-4-4-3), which intersects with our situation when in the next column CVC code.

I understand correctly that there is no solution to this problem? Because TextAbsorber and TextFragmentAbsorber extract text differently.

By the way, we tried to apply various TextExtractionOptions, it affects the extraction in TextAbsorber, and it does not work in TextFragmentAbsorber.

asad.ali · February 28, 2019, 5:02pm

@coderandom1

Thanks for getting back to us.

Yes, TextAbsorber extracts text as single String Value and TextFragmentAbsorber extracts it at Text Fragment level.

It depends upon the scenario where and how you are using these options. If you could please share some sample code snippet that you have tried and observed an issue, it would help us testing the scenario in our environment and address it accordingly.

Please use following regular expression in that case, which recognizes previously mentioned number formations as well as the above one.

\b(\d{4})([ -]\s*)(\1\2\1\2\1(\2\d{3})?)|\d{16}\b