Non breaking space issues on search

Pinho · November 11, 2019, 6:27pm

Hi,

Currently, I need to search a PDF file for specific strings and I have one case in which I can’t match the string.

The case is with the string “Unaudited Capital Account Statement”. In the PDF, this same string, in between each word, the white space is actually a non breaking space, more precisely it contains the Unicode \u00A0. Because of this, I cannot match the string with the PDF content and it fails.

One possible solution for this case would be to replace all the occurrences of the non breaking space and replace them for " ".

Is there any cleaner solution for this problem?

Thank you in advance,

Best Regards,
Daniel

Farhan.Raza · November 12, 2019, 9:09am

@Pinho

Thank you for contacting support.

While searching for the phrase “Unaudited Capital Account Statement”, you may please try “Unaudited\sCapital\sAccount\sStatement” that means any white space character may appear between words.

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

Pinho · November 13, 2019, 8:20pm

Thank you! That worked!

Meanwhile, I got another case which it seems to be more complicated.

I got the string “Schedule of Partner’s Capital Account” to find, but the text fragment I got is:

Schedule of Partner
’
s Capital Account

or basically “Schedule of Partner\r\n’\r\ns Capital Account”.

Is there any way I could match this with a regex without getting false positives?

Thank you,
Daniel

Farhan.Raza · November 13, 2019, 11:34pm

@Pinho

Thank you for your kind feedback.

Please share your sample PDF document so that we may investigate this scenario and help you out. Before getting back to us, please ensure using Aspose.PDF for .NET 19.11.