Hello,
I am using Aspose.PDF for .NET to extract the individual words from a PDF. Whenever I pull a word that starts with a left parentheses – “(” – that character is always missing from the extracted text. The right version of the character is always present. The left character is present in the OCR’d text when I simply copy paste it from the document into a notepad manually, so I know that it should be available for extraction by Aspose.
Any help you could offer on this would be appreciated, as I need to extract all the words from the document as they are on the page.
Thank you.
@instaknow
Would you kindly share a sample PDF along with complete sample code snippet. We will test the scenario in our environment and address it accordingly.
Here is a test PDF that exhibits the same issue as the real data PDFs: test capital x.pdf (5.4 KB)
On the page, we are actually seeing multiple characters drop (all left parentheses as well as all dollar signs). For instance in this piece of text:
Investments, at fair value (cost basis $123,000,000)
we are seeing this result:
Investments, at fair value cost basis 123,000,000)
It is very important that we not lose these characters as they are integral to the meaning we are trying to pull from the file.
Here is a modified code snippet – I cannot post the full code here, but I believe this should be enough to display the issue (I apologize if there are minor issues, I had to pull it out of a much larger otherwise working piece of code):
Imports Aspose
Imports Aspose.Pdf
Imports Aspose.Pdf.Page
Imports Aspose.Pdf.Text
Imports Aspose.Pdf.Text.TextOptions
Imports Aspose.Pdf.Document
Module Test_Code_for_Aspose
Private Sub Get_Text_Details_Using_Aspose(ByRef extractedTextGrid As DataTable)
Dim errMsg As String = ""
Try
'>>> Setup Aspose
'license setup goes here normally
'>>> Open document
Dim pdfFilepath As String = ""
Dim pdfDocument As New Aspose.Pdf.Document(pdfFilepath)
Dim pdfTotalPages As Integer = pdfDocument.Pages.Count
Dim pageNumber As Integer
'>>> Find word information for each page
Dim extractedText As String
Dim extractedTextRow As DataRow
For pageNumber = 1 To pdfTotalPages
Dim textAbsorberPage As New TextAbsorber()
Dim textFragmentAbsorber As New TextFragmentAbsorber("\b\S+")
Dim textSearchOptions As New TextSearchOptions(True)
textFragmentAbsorber.TextSearchOptions = textSearchOptions
pdfDocument.Pages(pageNumber).Accept(textAbsorberPage)
extractedText = textAbsorberPage.Text
pdfDocument.Pages(pageNumber).Accept(textFragmentAbsorber)
Dim textFragmentCollection As TextFragmentCollection = textFragmentAbsorber.TextFragments
'loop through the fragments
For Each textFragment As TextFragment In textFragmentCollection
extractedTextRow = extractedTextGrid.NewRow
extractedTextGrid.Rows.Add(extractedTextRow)
extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Word") = textFragment.Text
Next textFragment
textFragmentAbsorber = Nothing
textAbsorberPage = Nothing
GC.Collect()
Next
Catch ex As Exception
errMsg = errMsg & " " & ex.Message.ToString
End Try
End Sub
End Module
Further testing on our end has given us the following results:
- If a left parentheses or dollar sign are at the start of a word, they disappear from the extracted text.
- If one of these characters is placed in the middle of a word, they are found normally.
- If one of these characters is placed by itself (surrounded by spaces), they disappear from the extracted text.
We have just started to test with other characters in this manner as well and found that we are seeing similar issues when a right parentheses is at the start of a word or is found by itself (surrounded by spaces).
Edited to add: we have tested with a wide variety of unicode characters and found that the behavior happens for almost all non-alphanumeric characters that are stand-alone or found at the beginning of a word. Here is a screenshot comparison of a small amount of a larger character test – the results were very much the same across the board: screenshot of unicode character test.png (85.8 KB)
@instaknow
We have checked the information shared by you and it seemed like the regular expression used in your code snippet needed to be improved so that it could extract the required results. Please try using \S+
instead of \b\S+
in your code snippet and share your feedback with us.
Thank you very much for your reply – you are correct and that appears to have resolved our issue. We recently updated our OCR software – the old regex worked perfectly in Aspose with the previous version and then stopped working with the update, so it was hard for us to see that change.
Thank you so much for your assistance, we appreciate it.
@instaknow
Thanks for your kind feedback.
Please keep using our API and in case you need further assistance, please feel free to let us know.