Request for Support - Retrieving Word by Word Bounding Information in .docx Files

bmanitn · February 10, 2023, 12:27pm

Dear Aspose Support Team,

I hope this email finds you well. I am writing to request your assistance with a problem I am facing while using the Aspose library.

I am working on a project that involves retrieving bounding box information from a .docx file, specifically word by word information. I am using the Aspose.Words library and have been trying to use the getPageInfo().getBounds() method and other similar methods, but I have noticed that these methods are not available in the latest version of Aspose.Words. I have also tried using the LayoutNode and Rect classes, but they are also not found.

I would like to inform you that I already have a Developer License for Aspose.Words.

I am using Aspose.Words version 22.10 and would greatly appreciate it if you could provide any guidance or examples on how to retrieve the word by word bounding information from a .docx file using this version of the library.

Thank you for your time and consideration. I look forward to your response.

Best regards,
Manikandan B

alexey.noskov · February 10, 2023, 2:17pm

@bmanitn As you may know, MS Word documents are flow documents and do not contain any information about document layout. The consumer applications, like MS Word or Open Office builds document layout on the fly. Aspose.Words uses it’s own layout engine to build document layout to render the document to PDF or any other fixed page formats and for printing. Also, Aspose.Words provides LayoutCollector and LayoutEnumerator classes, that allows to get layout information from the document.
Your requirements can be achieved using LayoutCollector and LayoutEnumerator classes. It is required to wrap each word in your document into a bookmark and then determine rectangle of bookmark start and bookmark end using LayoutCollector and LayoutEnumerator. Union of these rectangles will give you bounding box of the word calculated by Aspose.Words layout engine. For example see the following code:

Document doc = new Document("C:\\Temp\\in.docx");

// Regular expression that match one or more word characters sequence.
Pattern wordRegex = Pattern.compile("\\w+");

// Use replace functionality to split runs in the document so they contain only one word.
FindReplaceOptions opt = new FindReplaceOptions();
opt.setUseSubstitutions(true);
doc.getRange().replace(wordRegex, "$0", opt);

// Now wrap each "word" run into a bookmark.
LinkedHashMap<String, Run> wordBookmakrs = new LinkedHashMap<String, Run>();
int bkIndex = 0;
Iterable<Run> runs = doc.getChildNodes(NodeType.RUN, true);
for (Run r : runs)
{
    // Skip Runs in header/footer.
    // LayoutCollector and LayoutEnumerator classes does not work with nodes in header/footer.
    if (r.getAncestor(NodeType.HEADER_FOOTER) != null)
        continue;

    // Skip Runs with text that does not match the regular expression (whitespaces)
    if (!wordRegex.matcher(r.getText()).matches())
        continue;

    String bkName = "word_bookmark_" + bkIndex;
    bkIndex++;
    wordBookmakrs.put(bkName, r);

    r.getParentNode().insertBefore(new BookmarkStart(doc, bkName), r);
    r.getParentNode().insertAfter(new BookmarkEnd(doc, bkName), r);
}

// Create LayoutCollector and LayoutEnumerator
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

// Print bounding boxes of runs with words.
for (String bkName : wordBookmakrs.keySet())
{
    Run r = wordBookmakrs.get(bkName);
    Bookmark bk = doc.getRange().getBookmarks().get(bkName);

    enumerator.setCurrent(collector.getEntity(bk.getBookmarkStart()));
    Rectangle2D startRect = enumerator.getRectangle();
    enumerator.setCurrent(collector.getEntity(bk.getBookmarkEnd()));
    Rectangle2D endRect = enumerator.getRectangle();

    // Union of the start and end rectangles is the bounding box of the run.
    Rectangle2D result = startRect.createUnion(endRect);

    System.out.println(r.getText() + " -----  " + result);
}