Matching for end of line in pdf file inconsistent on different platforms

jzsetthg · September 17, 2021, 2:23pm

Hi,

I’ve found an issue during processing pdf files on different platforms.
The problem occurs during the “textfragmentabsorber” instance matching by “accept” on a document’s page object.
The issue itself is: When matching end of line without additional characters from the next line, Windows and Linux works different ways.

The text we are trying to match: 7\r\n

Windows can match that, Linux cannot.

If we include a character from the next line: 7\r\n6

Both Windows and Linux works.

I also noticed that if I remove the \n character: 7\r

Both Windows and Linux works.

I also included a minimal test project, which will write out the matched texts for different regexes by unicode characters which will show the difference if run on different platforms.

Is there anything I could do to avoid this behavior or this is an issue which should be fixed on your side?

Thanks in advance

PdfBreakLine.zip (12.7 KB)

asad.ali · September 19, 2021, 9:17pm

@jzsetthg

Would you please share a bit more like what do you actually expect about how API should behave? There are some regular expressions that work in both Windows and Linux. Do you want to use only one type of regex i.e. 7\r\n for both environments? Please share the requested information so that we can investigate the scenario accordingly and share our feedback with you.

jzsetthg · September 20, 2021, 6:56am

@asad.ali

Thank you for your fast response. Yes, I need to use one regex for both environments. I think if the 7\r\n regex works on windows, it should also work on linux, as the file content won’t change based on the fact that it’s on an other operating system. Also if 7\r\n6 works, 7\r\n should also work in my opinion.

asad.ali · September 20, 2021, 6:39pm

@jzsetthg

We have logged an investigation ticket as PDFNET-50609 in our issue management system to further analyze this case. We will look into its details and let you know once the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.