How to Detect "Invisible" Characters

instaknow · November 13, 2023, 8:28pm

Hello,

We are using Aspose PDF.Net to pull words out of PDF files, along with basic related information (coordinates, font size, font family, text color). We do this using textFragmentAbsorber, TextFragmentCollection, and TextFragments.

We have started to encounter files that have “invisible” text in them – text that is not visible when reading the file on screen, but can be found if all the text on the page is highlighted and copied out. Aspose seems to see this additional “invisible” text the same as all surrounding text. Sometimes the text color is reported differently than the rest, but often it is not.

We need to differentiate this “invisible” text from the rest. I have attached an example to show what I mean.

Invisible Text.pdf (39.1 KB)

In this bank statement file, there are four invisible words in the bottom left hand corner: “Powered by TCPDF (www.tcpdf.org)” According to the data provided from the textFragments for each word, the text color is “#00” - the same as all other words on this page.

Are there any methods that we have overlooked for differentiating the “invisible” words from all of the visible ones?

Thank you!

asad.ali · November 13, 2023, 9:48pm

@instaknow

Would you please share your sample code snippet that you are using to extract the words from PDF? We will test the scenario in our environment and address it accordingly.

instaknow · November 14, 2023, 3:14pm

Thank you, here is a modified code snippet – I cannot post the full code here, but I believe this should be enough to display the issue (I apologize if there are minor issues, I had to pull it out of a much larger otherwise working piece of code):

Imports Aspose
Imports Aspose.Pdf
Imports Aspose.Pdf.Page
Imports Aspose.Pdf.Text
Imports Aspose.Pdf.Text.TextOptions
Imports Aspose.Pdf.Document


Module Test_Code_for_Aspose

    Private Sub Get_Text_Details_Using_Aspose(ByRef extractedTextGrid As DataTable)

        Dim errMsg As String = ""

        Try
            '>>> Setup Aspose
            'license setup goes here normally

            '>>> Open document
            Dim pdfFilepath As String = ""
            Dim pdfDocument As New Aspose.Pdf.Document(pdfFilepath)
            Dim pdfTotalPages As Integer = pdfDocument.Pages.Count
            Dim pageNumber As Integer

            '>>> Find word information for each page
            Dim extractedText As String
            Dim extractedTextRow As DataRow
            For pageNumber = 1 To pdfTotalPages
                Dim textAbsorberPage As New TextAbsorber()
                Dim textFragmentAbsorber As New TextFragmentAbsorber("\b\S+")
                Dim textSearchOptions As New TextSearchOptions(True)
                textFragmentAbsorber.TextSearchOptions = textSearchOptions
                pdfDocument.Pages(pageNumber).Accept(textAbsorberPage)
                extractedText = textAbsorberPage.Text
                pdfDocument.Pages(pageNumber).Accept(textFragmentAbsorber)
                Dim textFragmentCollection As TextFragmentCollection = textFragmentAbsorber.TextFragments

                'loop through the fragments
                For Each textFragment As TextFragment In textFragmentCollection
                    extractedTextRow = extractedTextGrid.NewRow
                    extractedTextGrid.Rows.Add(extractedTextRow)
                    extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Word") = textFragment.Text
                    extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("FontName") = textFragment.TextState.Font.FontName
                    extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("FontSize") = textFragment.TextState.Font.FontSize
                    extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Rotation") = textFragment.TextState.Rotation
                Next textFragment

                textFragmentAbsorber = Nothing
                textAbsorberPage = Nothing
                GC.Collect()
            Next

        Catch ex As Exception
            errMsg = errMsg & " " & ex.Message.ToString
        End Try
    End Sub


End Module

asad.ali · November 14, 2023, 7:02pm

@instaknow

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55908

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

instaknow · November 14, 2023, 7:15pm

Hello,

Thank you for creating a ticket, but this was less for a bug report and more a question about how best to accomplish something. Is there an existing method that could be used to find text that is not visible on the PDF?

Thank you!

asad.ali · November 15, 2023, 12:13pm

@instaknow

Yes, you are right. We have not generated this ticket for bug report. Actually, it is an investigation ticket against which we will be performing some analysis and will look into the internal components of the API. We will try to investigate whether this particular requirement is feasible to achieve using existing API models or not and share our feedback with you as soon as the ticket is resolved.

instaknow · November 15, 2023, 2:00pm

Thank you very much for correcting me, I misunderstood!

instaknow · November 27, 2023, 7:12pm

Is there any update on this issue? Thank you very much for your time.

asad.ali · November 27, 2023, 10:49pm

@instaknow

The ticket is still under the investigation phase. However, can you please try using the below code to detect and remove the hidden text:

var document = new Document(inputFile);
var textAbsorber = new TextFragmentAbsorber();

// This option can be used to prevent other text fragments from moving after hidden text replacement.
textAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

document.Pages.Accept(textAbsorber);

foreach (var fragment in textAbsorber.TextFragments)
{
    if (fragment.TextState.Invisible)
    {
        fragment.Text = "";
    }
}

document.Save(outputFile);

instaknow · November 28, 2023, 1:43pm

Thank you for the update. We tested the code and found that it works with a very small number of files. For most files, the “invisible” text returns a False when using this code.

asad.ali · November 28, 2023, 6:41pm

@instaknow

Would it be possible for you to share such files having invisible text and API is still unable to detect it? We will update the information of the logged ticket accordingly.

instaknow · November 29, 2023, 2:22pm

I have attached below two files:

the redacted PDF which contains invisible text that is not detectable.
a screenshot highlighting the invisible text (mostly “$” characters)

I’m happy to provide any further assistance that helps resolve the issue.

Thank you for your ongoing help.

Invisible Text (textstate.invisible does not work).pdf (129.1 KB)

Invisible Text (textstate.invisible does not work) - highlighted issues.png (65.0 KB)

asad.ali · November 29, 2023, 8:52pm

@instaknow

Thanks for sharing the requested information. We have updated the ticket information and will let you know as soon as it is resolved. Please spare us some time.

We are sorry for the inconvenience.

instaknow · December 14, 2023, 7:37pm

Hello,

Is there any news on this item? Thank you very much for your time.

asad.ali · December 14, 2023, 11:00pm

@instaknow

We are afraid that the issue has not been yet fully investigated. We will definitely prioritize the ticket on first come first serve basis and as soon as we have some news about ticket resolution, we will share with you. Please spare us some time.