I am supplying the regex “\S+” to a TextFragmentAbsorber in order to split a PDF page into individual words. This has worked well on many, many documents, but I have hit a snag with a recent one (see attached PDF below).
Near the bottom of the page there are two lines that start with “4/15/22” – near the end of each is the text “Bradenton” and “FL”. In every other line the words are separated cleanly into two pieces. In these two lines, the TextFragment is “BradentonF” and “L” (as though the space occurred between the F and the L instead). (See attached image file below.)
When I copy paste the text of the PDF from a reader application into a text editor, it shows the spacing correctly, but so far I have been unable to get Aspose to return the correct results.
Could you please take a look and let me know if I am doing something incorrectly?
Thank you!
Boxed Problem.png (104.7 KB)
MDH - Page 45 Only.pdf (121.5 KB)