Aspose.PDF Losing All Left Parentheses

instaknow · March 4, 2020, 3:49am

Hello,

I am using Aspose.PDF for .NET to extract the individual words from a PDF. Whenever I pull a word that starts with a left parentheses – “(” – that character is always missing from the extracted text. The right version of the character is always present. The left character is present in the OCR’d text when I simply copy paste it from the document into a notepad manually, so I know that it should be available for extraction by Aspose.

Any help you could offer on this would be appreciated, as I need to extract all the words from the document as they are on the page.

Thank you.

asad.ali · March 4, 2020, 1:16pm

@instaknow

Would you kindly share a sample PDF along with complete sample code snippet. We will test the scenario in our environment and address it accordingly.

instaknow · March 4, 2020, 6:11pm

Here is a test PDF that exhibits the same issue as the real data PDFs: test capital x.pdf (5.4 KB)

On the page, we are actually seeing multiple characters drop (all left parentheses as well as all dollar signs). For instance in this piece of text:

Investments, at fair value (cost basis $123,000,000)

we are seeing this result:

Investments, at fair value cost basis 123,000,000)

It is very important that we not lose these characters as they are integral to the meaning we are trying to pull from the file.

Here is a modified code snippet – I cannot post the full code here, but I believe this should be enough to display the issue (I apologize if there are minor issues, I had to pull it out of a much larger otherwise working piece of code):

Imports Aspose
Imports Aspose.Pdf
Imports Aspose.Pdf.Page
Imports Aspose.Pdf.Text
Imports Aspose.Pdf.Text.TextOptions
Imports Aspose.Pdf.Document

Module Test_Code_for_Aspose

Private Sub Get_Text_Details_Using_Aspose(ByRef extractedTextGrid As DataTable)

    Dim errMsg As String = ""

    Try
        '>>> Setup Aspose
        'license setup goes here normally

        '>>> Open document
        Dim pdfFilepath As String = ""
        Dim pdfDocument As New Aspose.Pdf.Document(pdfFilepath)
        Dim pdfTotalPages As Integer = pdfDocument.Pages.Count
        Dim pageNumber As Integer

        '>>> Find word information for each page
        Dim extractedText As String
        Dim extractedTextRow As DataRow
        For pageNumber = 1 To pdfTotalPages
            Dim textAbsorberPage As New TextAbsorber()
            Dim textFragmentAbsorber As New TextFragmentAbsorber("\b\S+")
            Dim textSearchOptions As New TextSearchOptions(True)
            textFragmentAbsorber.TextSearchOptions = textSearchOptions
            pdfDocument.Pages(pageNumber).Accept(textAbsorberPage)
            extractedText = textAbsorberPage.Text
            pdfDocument.Pages(pageNumber).Accept(textFragmentAbsorber)
            Dim textFragmentCollection As TextFragmentCollection = textFragmentAbsorber.TextFragments

            'loop through the fragments
            For Each textFragment As TextFragment In textFragmentCollection
                extractedTextRow = extractedTextGrid.NewRow
                extractedTextGrid.Rows.Add(extractedTextRow)
                extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Word") = textFragment.Text
            Next textFragment

            textFragmentAbsorber = Nothing
            textAbsorberPage = Nothing
            GC.Collect()
        Next

    Catch ex As Exception
        errMsg = errMsg & " " & ex.Message.ToString
    End Try
End Sub

End Module

instaknow · March 4, 2020, 8:23pm

Further testing on our end has given us the following results:

If a left parentheses or dollar sign are at the start of a word, they disappear from the extracted text.
If one of these characters is placed in the middle of a word, they are found normally.
If one of these characters is placed by itself (surrounded by spaces), they disappear from the extracted text.

We have just started to test with other characters in this manner as well and found that we are seeing similar issues when a right parentheses is at the start of a word or is found by itself (surrounded by spaces).

Edited to add: we have tested with a wide variety of unicode characters and found that the behavior happens for almost all non-alphanumeric characters that are stand-alone or found at the beginning of a word. Here is a screenshot comparison of a small amount of a larger character test – the results were very much the same across the board: screenshot of unicode character test.png (85.8 KB)

asad.ali · March 4, 2020, 9:29pm

@instaknow

We have checked the information shared by you and it seemed like the regular expression used in your code snippet needed to be improved so that it could extract the required results. Please try using \S+ instead of \b\S+ in your code snippet and share your feedback with us.

instaknow · March 4, 2020, 9:49pm

Thank you very much for your reply – you are correct and that appears to have resolved our issue. We recently updated our OCR software – the old regex worked perfectly in Aspose with the previous version and then stopped working with the update, so it was hard for us to see that change.

Thank you so much for your assistance, we appreciate it.

asad.ali · March 5, 2020, 12:07am

@instaknow

Thanks for your kind feedback.

Please keep using our API and in case you need further assistance, please feel free to let us know.