Hello,
We are using Aspose PDF.Net to pull words out of PDF files, along with basic related information (coordinates, font size, font family, text color). We do this using textFragmentAbsorber, TextFragmentCollection, and TextFragments.
We have started to encounter files that have “invisible” text in them – text that is not visible when reading the file on screen, but can be found if all the text on the page is highlighted and copied out. Aspose seems to see this additional “invisible” text the same as all surrounding text. Sometimes the text color is reported differently than the rest, but often it is not.
We need to differentiate this “invisible” text from the rest. I have attached an example to show what I mean.
Invisible Text.pdf (39.1 KB)
In this bank statement file, there are four invisible words in the bottom left hand corner: “Powered by TCPDF (www.tcpdf.org)” According to the data provided from the textFragments for each word, the text color is “#00” - the same as all other words on this page.
Are there any methods that we have overlooked for differentiating the “invisible” words from all of the visible ones?
Thank you!
@instaknow
Would you please share your sample code snippet that you are using to extract the words from PDF? We will test the scenario in our environment and address it accordingly.
Thank you, here is a modified code snippet – I cannot post the full code here, but I believe this should be enough to display the issue (I apologize if there are minor issues, I had to pull it out of a much larger otherwise working piece of code):
Imports Aspose
Imports Aspose.Pdf
Imports Aspose.Pdf.Page
Imports Aspose.Pdf.Text
Imports Aspose.Pdf.Text.TextOptions
Imports Aspose.Pdf.Document
Module Test_Code_for_Aspose
Private Sub Get_Text_Details_Using_Aspose(ByRef extractedTextGrid As DataTable)
Dim errMsg As String = ""
Try
'>>> Setup Aspose
'license setup goes here normally
'>>> Open document
Dim pdfFilepath As String = ""
Dim pdfDocument As New Aspose.Pdf.Document(pdfFilepath)
Dim pdfTotalPages As Integer = pdfDocument.Pages.Count
Dim pageNumber As Integer
'>>> Find word information for each page
Dim extractedText As String
Dim extractedTextRow As DataRow
For pageNumber = 1 To pdfTotalPages
Dim textAbsorberPage As New TextAbsorber()
Dim textFragmentAbsorber As New TextFragmentAbsorber("\b\S+")
Dim textSearchOptions As New TextSearchOptions(True)
textFragmentAbsorber.TextSearchOptions = textSearchOptions
pdfDocument.Pages(pageNumber).Accept(textAbsorberPage)
extractedText = textAbsorberPage.Text
pdfDocument.Pages(pageNumber).Accept(textFragmentAbsorber)
Dim textFragmentCollection As TextFragmentCollection = textFragmentAbsorber.TextFragments
'loop through the fragments
For Each textFragment As TextFragment In textFragmentCollection
extractedTextRow = extractedTextGrid.NewRow
extractedTextGrid.Rows.Add(extractedTextRow)
extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Word") = textFragment.Text
extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("FontName") = textFragment.TextState.Font.FontName
extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("FontSize") = textFragment.TextState.Font.FontSize
extractedTextGrid.Rows(extractedTextGrid.Rows.Count - 1).Item("Rotation") = textFragment.TextState.Rotation
Next textFragment
textFragmentAbsorber = Nothing
textAbsorberPage = Nothing
GC.Collect()
Next
Catch ex As Exception
errMsg = errMsg & " " & ex.Message.ToString
End Try
End Sub
End Module
@instaknow
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-55908
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
Hello,
Thank you for creating a ticket, but this was less for a bug report and more a question about how best to accomplish something. Is there an existing method that could be used to find text that is not visible on the PDF?
Thank you!
@instaknow
Yes, you are right. We have not generated this ticket for bug report. Actually, it is an investigation ticket against which we will be performing some analysis and will look into the internal components of the API. We will try to investigate whether this particular requirement is feasible to achieve using existing API models or not and share our feedback with you as soon as the ticket is resolved.
Thank you very much for correcting me, I misunderstood!
Is there any update on this issue? Thank you very much for your time.
@instaknow
The ticket is still under the investigation phase. However, can you please try using the below code to detect and remove the hidden text:
var document = new Document(inputFile);
var textAbsorber = new TextFragmentAbsorber();
// This option can be used to prevent other text fragments from moving after hidden text replacement.
textAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);
document.Pages.Accept(textAbsorber);
foreach (var fragment in textAbsorber.TextFragments)
{
if (fragment.TextState.Invisible)
{
fragment.Text = "";
}
}
document.Save(outputFile);
Thank you for the update. We tested the code and found that it works with a very small number of files. For most files, the “invisible” text returns a False when using this code.
@instaknow
Would it be possible for you to share such files having invisible text and API is still unable to detect it? We will update the information of the logged ticket accordingly.
I have attached below two files:
- the redacted PDF which contains invisible text that is not detectable.
- a screenshot highlighting the invisible text (mostly “$” characters)
I’m happy to provide any further assistance that helps resolve the issue.
Thank you for your ongoing help.
Invisible Text (textstate.invisible does not work).pdf (129.1 KB)
Invisible Text (textstate.invisible does not work) - highlighted issues.png (65.0 KB)
@instaknow
Thanks for sharing the requested information. We have updated the ticket information and will let you know as soon as it is resolved. Please spare us some time.
We are sorry for the inconvenience.
Hello,
Is there any news on this item? Thank you very much for your time.
@instaknow
We are afraid that the issue has not been yet fully investigated. We will definitely prioritize the ticket on first come first serve basis and as soon as we have some news about ticket resolution, we will share with you. Please spare us some time.