Traversing .docx word by word

Kusumanchi.Rajesh · August 10, 2015, 4:23am

Hi,
I would like to know how to traverse in a .docx file word by word, something similar to what we do in an array. I would like compare each word traversed with a set of words already present with me, and then would like to hyperlink those matching words in the document and this needs to be done sequentially.

tahir.manzoor · August 10, 2015, 6:14am

Hi Kusumanchi,

Thanks for your inquiry. Please note that Aspose.Words is quite different from the
Microsoft Word’s Object Model in that it represents the document as a tree of objects
more like an XML DOM tree. If you worked with any XML DOM library you
will find it is easy to understand and work with Aspose.Words. When you
load a Word document into Aspose.Words, it builds its DOM and all
document elements and formatting are simply loaded into memory. Please
read the following articles for more information on DOM:
https://docs.aspose.com/words/java/aspose-words-document-object-model/
https://docs.aspose.com/words/java/logical-levels-of-nodes-in-a-document/

In your case, I suggest you please save your document to text format and read the words using following code example. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "in.docx");
String str = doc.toString(SaveFormat.TEXT);
String[] splited = str.split(" ");
for(String text : splited)
{
    System.out.println(text);
}

Kusumanchi.Rajesh · August 10, 2015, 7:05am

Hi Tahir,

Thanks for your effort for trying to resolve my query but sadly the approach you have suggested won’t help me much because I need to manipulate all the changes in the same document.I should keep the formatting of the document intact and also need to add hyperlinks. So, if there is no approach for traversing word by word, is there any function or method that could give me info about the offsets of the words?

Thanks and Regards,
Rajesh

tahir.manzoor · August 11, 2015, 4:00am

Hi Rajesh,

Thanks for your inquiry. Please note that all text of the document is stored in runs of text. In your case, I suggest you please split each word in separate Run node. Please use following code example to split each word in separate Run node.

Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "in.docx");
for (Paragraph paragraph : (Iterable)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    SplitRuns(paragraph);
}
doc.save(MyDir + "Out.docx");

private static Run splitRun(Run run, int position) throws Exception
{
    Run afterRun = (Run)run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring((0), (0) + (position)));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}
private static void SplitRuns(Paragraph paragraph) throws Exception
{
    for (Node run : paragraph.getRuns().toArray())
    {
        int position = run.getText().indexOf(' ');
        Run runnode = (Run)run;
        while (position >= 0 && runnode.getText().length() >= position)
        {
            Run newRun = splitRun(runnode, position);
            position = newRun.getText().indexOf(' ');
            if (position == -1)
                break;
            position++;
            runnode = newRun;
        }
    }
}