Text content

andyradchenko · August 7, 2013, 4:18am

I extract text content from Docx document.
Is there a way to differentiate Run, which contains some real text, represented on page from Run, with Run which contains field information (for example HYPERLINK)?
Basically I need somehow ignore these Runs, with field-related text inside.
But I can’t find a civilized way to do that.

I could create set with names of all possible fields and filter out all Runs which starts with either of this name. But in that case - what if there would be Run which starts with HYPERLINK for example and it would be not related to field? - I would ignore it as well and it’s not a desired behavior.
Field related text is inside w:instrText node, where usual content is in w:t, but I can’t find anything that would indicate that Run is field-related in properties.
Could you advise how to resolve this problem?

awais.hafeez · August 12, 2013, 4:21am

Hi Andrey,

Thanks for your inquiry. Please download the latest version of Aspose.Words for Java from here (13.7.0) and then use the following code snippet to obtain the text representation of Paragraphs:

NodeCollection nodeColl = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph para : (Iterable<Paragraph>) nodeColl)
{
    String text = para.toString(SaveFormat.TEXT);
    System.out.println(text);
}

I hope, this helps.

Best regards,