How to get hidden text from DOCX using Java

Aspose Team,

Using the Java Aspose 19.11 word library, we’re trying to identify Hidden Text data from various DOCX documents. However, when the included sample document is tested, we see that the Aspose library fails to identify the hidden portions of the document - see the sscott5@enron.com email address in this document specifically. When viewing the document in Microsoft Word, however, and revealing the hidden text, it’s apparent that some of the document text is in fact hidden. The Aspose document “runs” don’t reflect this.

Here’s a sample bit of code meant to show the relevant pieces:

Document doc = new Document(path);
StringBuilder sb = new StringBuilder();
for (Object node : doc.getChildNodes(NodeType.RUN, true)) {
Run run = (Run)node;
if(run.getFont().getHidden()) {
addEntry(sb, run.toString(SaveFormat.TEXT));
}
}
if(sb.length() > 0) {
Logger.Log(“Doc Hidden Text=” + sb.toString());
}

I’ve attached the DOCX file that exhibits the problem that we’re seeing. Please advise.

Thanks for your help.
JerryHiddenTextSampleForAspose.zip (11.4 KB)

@jmuth

We have tested the scenario using latest version of Aspose.Words for Java 20.5 with following simple code example and have not found the shared issue. So, please use Aspose.Words for Java 20.5.

Document doc = new Document(MyDir + "Test for hidden content track changes.docx");
for(Run run  :(Iterable<Run>)doc.getChildNodes(NodeType.RUN, true))
{
    if(run.getFont().getHidden())
    {
        System.out.println(run.getText());
    }
}

Tahir,

Thanks for your quick reply. We will upgrade our version of Aspose Words and see if this resolves this issue.

Thanks,
Jerry

@jmuth

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Hi, Tahir.

As you suggested we tried out the Aspose Words 20.5 upgrade from 19.11. Unfortunately we still see the same bug as before - not all of the document’s hidden content is being identified. In the first example file I provided, the sscott5@enron.com email address was hidden and not identified. In the second example file I’m attaching, the hidden text word “line” is not identified. However, if you enable viewing of Hidden Text in Microsoft Word both of these text items are identified as hidden text. Can you please investigate this issue again? Thank you.
HiddenTextSampleForAspose_Example2.zip (12.4 KB)

@jmuth

Your document contains format revision as ‘Hidden’. You can use following code example to get the desired output.

Document doc = new Document(MyDir + "Test for hidden content track changes_v2.docx");
Document docClone = (Document)doc.deepClone(true);
docClone.acceptAllRevisions();
for(Run run  :(Iterable<Run>)docClone.getChildNodes(NodeType.RUN, true))
{
    if(run.getFont().getHidden())
    {
        System.out.println(run.getText());
    }
}

Tahir,

Thank you for the updated code sample that shows how to find all of the hidden text. I can now confirm with the upgrade to Aspose Words 20.5 and the use of updated code as in your sample, we can now extract all of the hidden text for these documents. I appreciate your help in resolving this matter.

Thanks again.
Jerry

@jmuth

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.