Iterate through the Document to get Paragraphs and Tables

SnehaPurohit · March 18, 2019, 6:33am

What is the best recommended way to iterate through the document including Paragraphs and tables. I wish to include Shapes, OLE objects, equations and Tables. I wish to ignore table of contents. I understand that Paragraph includes the shapes , OLEObjects

If iterate though all the nodes with Node.Type = Any it brings in a lot of nodes that I really do not need to process.
NodeCollection nodes = doc.GetChildNodes(NodeType.Any, true);
I read in 1 of your article ( Iterate over the document) that using

Node node = doc;
while(node != doc.getLastSection().getBody().getLastParagraph().getLastChild())
{
node = node.nextPreOrder(doc);
}

Would be better.

Please let me know what is the best way to iterate though shapes, tables, paragraphs, equations, OLE Objects. And exclude header/footer Table of contents.

Suggest the best way considering performance.

tahir.manzoor · March 18, 2019, 4:39pm

@SnehaPurohit

You can use the following code example to exclude the nodes of header and footer and iterate over other nodes.

Document doc = new Document(MyDir + "in.docx");
for(Section section : doc.getSections())
{
    for(Node node : (Iterable<Node>)section.getBody().getChildNodes(NodeType.ANY, true))
    {
        //Your code
    }
}

To exclude specific nodes from document, we suggest you please clone the document’s node, remove the specific nodes from document and iterate over other nodes. Following code snippet shows how to remove shape nodes from cloned document.

Document doc = new Document(MyDir + "in.docx");
Document docClone = (Document) doc.deepClone(true);
docClone.getChildNodes(NodeType.SHAPE, true).clear();