Can you please provide sample code for this feature or point to the item in documentation as I can’t find any new functions for this. Is this different to the HiddenText detection added in 21.9 (What's new|Aspose.PDF for .NET 5)?
The feature to detect the text invisibility was added in 21.10 version of the API and its code example is already given on the link that you shared. We are not really sure, what you are trying to inquire here.
Please do not move my comment from the item I posted it on. It was specific to the item. Of course now it does not make sense because you’ve separated it from the item I am asking about with respect to PDFNET-38031 which is in 23.11
So what is PDFNET-38031 and how to use it and how does it differ to the what’s new for 21.9
PDFNET-38031 is marked as a feature of 23.11 (Remove hidden text from PDF file) but there is no mention on how to use it and what’s new.
After adding this feature in the older version, it was not working perfectly in case of some specific PDF documents. Therefore, an enhancement has been made to the same functionality and incorporated in 23.11 version. Below us the simple code snippet that is provided as to fix the reported issue:
Now property TextFragment.TextState.Invisible correctly shows hidden text.
To remove the hidden text, the following code snippet can be used:
var document = new Document(inputFile);
var textAbsorber = new TextFragmentAbsorber();
// This option can be used to prevent other text fragments from moving after hidden text replacement.
textAbsorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);
document.Pages.Accept(textAbsorber);
foreach (var fragment in textAbsorber.TextFragments)
{
if (fragment.TextState.Invisible)
{
fragment.Text = "";
}
}
document.Save(outputFile);
So to be clear, there is no change in what Aspose.PDF is doing. From my previous tests it detects text that have no “fill” colour. Is that correct? Is there any other criteria?
Does PDFNET-38031 address the issue the original poster had, ie “When I examined a PDF document in ADOBE then it shows lot of hidden text specially outside the page boundary” or was this already handled previously. I don’t have Acrobat Pro so I don’t know how to test this, so I would like to know if the sample code you provided detects text outside the page boundary as well.
Yes, this fix addresses this issue that original poster reported. Additionally, the algorithm of detecting invisible text in PDF document has been updated. For all the tested documents, Adobe Acrobat doesn’t find any hidden text after the code snippet is applied. You can try it with your PDF that has invisible text outside the page boundaries and let us know if API fails to perform as per expectations.
I have tested this over the last month and found some issues with the TextState.Invisible property being incorrect. I have used the following code based on your example to find hidden text:
Aspose.Pdf.Text.TextFragmentAbsorber textAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
textAbsorber.TextReplaceOptions = new Aspose.Pdf.Text.TextReplaceOptions(Aspose.Pdf.Text.TextReplaceOptions.ReplaceAdjustment.None);
document.Pages.Accept(textAbsorber);
foreach (Aspose.Pdf.Text.TextFragment fragment in textAbsorber.TextFragments)
{
if (fragment.TextState.Invisible)
HiddenText.Add($"Page { fragment.Page.Number }: { fragment.Text }");
}
However, the Invisible property is returning true for Visible text sometimes. I have attached a sample that displays the issue (this is not a standalone issue, I cannot send other documents though due to confidentiality)
I notice that in these cases the “RenderingMode” is set to FillText not Invisible where as in other cases the RenderingMode is Invisible when the item is actually not to be shown.
At the moment I have no confidence that the code provided will not remove text that is visible. I’m not sure why there’s a discrepancy, can you check why it’s reporting the text as invisible?
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-56284
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.