Incorrect Regex Splitting Into TextFragments

instaknow · August 16, 2023, 3:02pm

I am supplying the regex “\S+” to a TextFragmentAbsorber in order to split a PDF page into individual words. This has worked well on many, many documents, but I have hit a snag with a recent one (see attached PDF below).

Near the bottom of the page there are two lines that start with “4/15/22” – near the end of each is the text “Bradenton” and “FL”. In every other line the words are separated cleanly into two pieces. In these two lines, the TextFragment is “BradentonF” and “L” (as though the space occurred between the F and the L instead). (See attached image file below.)

When I copy paste the text of the PDF from a reader application into a text editor, it shows the spacing correctly, but so far I have been unable to get Aspose to return the correct results.

Could you please take a look and let me know if I am doing something incorrectly?

Thank you!

Boxed Problem.png (104.7 KB)
MDH - Page 45 Only.pdf (121.5 KB)

asad.ali · August 16, 2023, 10:27pm

@instaknow

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55299

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.