Incorrectly Splitting into TextFragments

instaknow · October 4, 2023, 4:49pm

I am supplying the regex “\S+” to a TextFragmentAbsorber in order to split a PDF page into individual words. This usually works well with the vast majority of documents, but a recent one has had an issue. I have attached the PDF to this post.

In the middle header of the second chart on the page, there’s a section that says “New / Fixed / Balance” (see attached image file below with red box). When text is copy pasted out of the PDF manually the “Balance” word is separated from the preceding slash by a space. However, when I apply our usual regex to the file and attempt to split it into textfragments, I get “/Balance” instead.

Could you please take a look and let me know if I am doing something incorrectly?

Thank you!

One Page Redacted.pdf (655.9 KB)
Boxed Problem.png (109.5 KB)

asad.ali · October 4, 2023, 8:17pm

@instaknow

The document structure looks complex and we need to perform investigation against it. Can you please also share the sample code snippet that you are using? We will log an investigation ticket and share the ID with you.

instaknow · October 5, 2023, 3:53pm

Thank you, here is a modified code snippet – I cannot post the full code here, but I believe this should be enough to display the issue (I apologize if there are minor issues, I had to pull it out of a much larger otherwise working piece of code):

Imports Aspose
Imports Aspose.Pdf
Imports Aspose.Pdf.Page
Imports Aspose.Pdf.Text
Imports Aspose.Pdf.Text.TextOptions
Imports Aspose.Pdf.Document


Module Test_Code_for_Aspose

    Private Sub Get_Text_Details_Using_Aspose(ByRef extractedTextGrid As DataTable)

        Dim errMsg As String = ""

        Try
            '>>> Setup Aspose
            'license setup goes here normally

            '>>> Open document
            Dim pdfFilepath As String = ""
            Dim pdfDocument As New Aspose.Pdf.Document(pdfFilepath)
            Dim pdfTotalPages As Integer = pdfDocument.Pages.Count
            Dim pageNumber As Integer

            '>>> Find word information for each page
            Dim extractedText As String
            Dim extractedTextRow As DataRow
            For pageNumber = 1 To pdfTotalPages
                Dim textAbsorberPage As New TextAbsorber()
                Dim textFragmentAbsorber As New TextFragmentAbsorber("\b\S+")
                Dim textSearchOptions As New TextSearchOptions(True)
                textFragmentAbsorber.TextSearchOptions = textSearchOptions
                pdfDocument.Pages(pageNumber).Accept(textAbsorberPage)
                extractedText = textAbsorberPage.Text
                pdfDocument.Pages(pageNumber).Accept(textFragmentAbsorber)
                Dim textFragmentCollection As TextFragmentCollection = textFragmentAbsorber.TextFragments

                'loop through the fragments
                For Each textFragment As TextFragment In textFragmentCollection
                    extractedTextRow = extractedTextGrid.NewRow
                    extractedTextGrid.Rows.Add(extractedTextRow)
                    extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Word") = textFragment.Text
                Next textFragment

                textFragmentAbsorber = Nothing
                textAbsorberPage = Nothing
                GC.Collect()
            Next

        Catch ex As Exception
            errMsg = errMsg & " " & ex.Message.ToString
        End Try
    End Sub


End Module

asad.ali · October 5, 2023, 7:26pm

@instaknow

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55635

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.