Iterate through document's nodes and extract data using Java

How do I iterate through a documents’ nodes and extract data?

E.g: consider a Word document with a single paragraph and an image. As I iterate through the documents nodes, how do I determine whether a node is a paragraph or an image? How do I get the image data out of the node? How do I tell what style is applied to a node?

Node.getNodeType() seems to always return 8 (Paragraph), and I can’t see any way to determine if the node is bold/italic/underline etc

Any help much appreciated.

Example:

Document d = new Document(“path/to/file.doc”);
Node curNode = d.getSections().get(0).getBody().getFirstChild();

while (curNode != null) {
Node nextNode = curNode.getNextSibling();
int nodeType = curNode.getNodeType();
if (nodeType == NodeType.PARAGRAPH) {
System.out.println(curNode.getText());
// How do I know if the text is bold/underline/italic? What about lists?
}
curNode = nextNode;
}

Hi Alex,

Thanks for your query. Please use the following code snippet for your requirement. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document("in.doc");

Node[] nodes = doc.getSections().get(0).getBody().getChildNodes().toArray();

int imageIndex = 0;

for(int i = 0; i < nodes.length; i++)

{

if(nodes[i].getNodeType() == NodeType.PARAGRAPH)

{

Node[] runs = ((Paragraph)nodes[i]).getChildNodes(NodeType.RUN, true).toArray();

for(int j = 0; j < runs.length; j++)

{

System.out.println(((Run)runs[j]).getFont().getBold());

System.out.println(((Run)runs[j]).getFont().getItalic());

System.out.println(((Run)runs[j]).getFont().getUnderline());

}

}

else if (nodes[i].getNodeType() == NodeType.SHAPE)

{

Shape shape = (Shape)nodes[i];

if (shape.hasImage())

{

String imageFileName = java.text.MessageFormat.format(

"Image.ExportImages.{0} Out{1}", imageIndex, FileFormatUtil.imageTypeToExtension(shape.getImageData().getImageType()));

shape.getImageData().save("d:\\" + imageFileName);

imageIndex++;

}

}

}

Hi Tahir,

Thank you for your help :slight_smile:

How do I determine if a Run has a hyperlink? Or if a Run is a List Item?

Hi Alex,

Thanks for your query. Please use the following code snippet for your requirement.

Document doc = new Document(“D:\in.docx”);

NodeCollection nodes = doc.getChildNodes(NodeType.FIELD_START, true);

for(FieldStart start : (Iterable<FieldStart>) nodes)

{

if(start.getFieldType() == FieldType.FIELD_HYPERLINK)

{

//Your code

}

}

Please let us know if you have any more queries.

Hi Tahir,

Again, thank you - but I really need to be able to determine if a Run is a Heading/plain text/list item etc etc. I can’t see any way to get the FontFormat to check for styling.

Hi Alex,

Thanks for your query. Please read following documentation links of Run and Paragraph classes for your kind reference.

Run do not contain all information. E.g to check list items Paragraph has method isListItem(). Similarly, Run does not have image information etc.

It would be great if you please share detail information about your query. We are always glad to help you.