I am supplying the regex “\S+” to a TextFragmentAbsorber in order to split a PDF page into individual words. This usually works well with the vast majority of documents, but a recent one has had an issue. I have attached the PDF to this post.
In the middle header of the second chart on the page, there’s a section that says “New / Fixed / Balance” (see attached image file below with red box). When text is copy pasted out of the PDF manually the “Balance” word is separated from the preceding slash by a space. However, when I apply our usual regex to the file and attempt to split it into textfragments, I get “/Balance” instead.
Could you please take a look and let me know if I am doing something incorrectly?
Thank you!
One Page Redacted.pdf (655.9 KB)
Boxed Problem.png (109.5 KB)
@instaknow
The document structure looks complex and we need to perform investigation against it. Can you please also share the sample code snippet that you are using? We will log an investigation ticket and share the ID with you.
Thank you, here is a modified code snippet – I cannot post the full code here, but I believe this should be enough to display the issue (I apologize if there are minor issues, I had to pull it out of a much larger otherwise working piece of code):
Imports Aspose
Imports Aspose.Pdf
Imports Aspose.Pdf.Page
Imports Aspose.Pdf.Text
Imports Aspose.Pdf.Text.TextOptions
Imports Aspose.Pdf.Document
Module Test_Code_for_Aspose
Private Sub Get_Text_Details_Using_Aspose(ByRef extractedTextGrid As DataTable)
Dim errMsg As String = ""
Try
'>>> Setup Aspose
'license setup goes here normally
'>>> Open document
Dim pdfFilepath As String = ""
Dim pdfDocument As New Aspose.Pdf.Document(pdfFilepath)
Dim pdfTotalPages As Integer = pdfDocument.Pages.Count
Dim pageNumber As Integer
'>>> Find word information for each page
Dim extractedText As String
Dim extractedTextRow As DataRow
For pageNumber = 1 To pdfTotalPages
Dim textAbsorberPage As New TextAbsorber()
Dim textFragmentAbsorber As New TextFragmentAbsorber("\b\S+")
Dim textSearchOptions As New TextSearchOptions(True)
textFragmentAbsorber.TextSearchOptions = textSearchOptions
pdfDocument.Pages(pageNumber).Accept(textAbsorberPage)
extractedText = textAbsorberPage.Text
pdfDocument.Pages(pageNumber).Accept(textFragmentAbsorber)
Dim textFragmentCollection As TextFragmentCollection = textFragmentAbsorber.TextFragments
'loop through the fragments
For Each textFragment As TextFragment In textFragmentCollection
extractedTextRow = extractedTextGrid.NewRow
extractedTextGrid.Rows.Add(extractedTextRow)
extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Word") = textFragment.Text
Next textFragment
textFragmentAbsorber = Nothing
textAbsorberPage = Nothing
GC.Collect()
Next
Catch ex As Exception
errMsg = errMsg & " " & ex.Message.ToString
End Try
End Sub
End Module
@instaknow
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-55635
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.