Searching using regex, differences between TextFragmentAbsorber and text from TextAbsorber

There seems to be differences between the text used for search using a TextFragmentAbsorber and the text extracted using a TextAbsorber.

The TextFragmentAbsorber version seems to have the white space consolidated.

The TextFormattingMode doesn’t seem to help with this issue.

In the text extracted from the pdf in the pack, there are several signature boxes which can be identified using the simple regex Signature\s{4,}Date. The TextFragmentAbsorber doesn’t find this.

I have attached a recreation pack that includes the .cs file and an example of a .pdf file that demonstrates the issue.

The result from the recreation code is that the count of finds should be the same.

This code was written using Aspose.Pdf version 23.5.0.

Recreator Code.zip (383.7 KB)

Any help or suggestions to help resolve this issue would be greatly appreciated.

Darren

@wraydc2

We need to investigate this case in details and for the purpose, an investigation ticket has been logged as PDFNET-54681 in our issue tracking system. We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

No problem, thank you.

This also doesn’t seem to be a one-off, there are many other documents that exhibit the same issue.

HTH,

Darren

@wraydc2

You can please share other documents with us as well with sample regex so that we can include them in our investigation

Yes, I will find some more - Have you not been able to recreate with the PDF provided?

@DarrenWray

Sometimes an issue is related with specific type of document and it is resolved only for that document. We always encourage to share all problematic files so that we can include it in our investigation process and perform multiple tests to ensure that issue is not happening with all of your files.

I think this issue may be related to another issue that I have identified: PDFNET-54761, this doesn’t involve regex but does require the finding of text extracted from the PDF.

@DarrenWray

Yes, we have already linked these tickets internally in our issue management system so that they both can be investigated from same perspective.

Any news on this issue?

@DarrenWray

We are afraid that the earlier logged tickets have not been yet resolved due to other pending issues in the queue. However, we will surely inform you as soon as we make some progress towards their resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

Any progress on this issue?

@DarrenWray

We regret to inform you that the earlier logged tickets have not been yet resolved due to other pending issues in the queue. As soon as we complete investigation against these tickets, we will able to share some updates about fix ETA with you. We highly appreciate your patient in this regard. We apologize for your inconvenience.

@DarrenWray

We checked the code you provided and can confirm that using the regex:

strSearchPattern = @"Signature(\s{4,})Date";

will not find the signature boxes. This is because the Flatten mode for TextFragmentAbsorber ignores extra white spaces.

The solution that can help in this case is to modify the regex as follows:

strSearchPattern = @"Signature(\s{1,})Date";

And the output of the code you provided will look like this:

PerformTextFragmentRegexSearch - Matches Found: 5
1 Signature Date
2 Signature Date
3 Signature Date
4 Signature Date
5 Signature Date

PerformExtractedRegexSearch - Matches Found: 5
1 Signature Date
2 Signature Date
3 Signature Date
4 Signature Date
5 Signature Date

We hope this will help to solve your issue.

The issues you have found earlier (filed as PDFNET-54761) have been fixed in Aspose.PDF for .NET 23.9.