How to get the bounding box coordinates of each paragraph and table object in a .docx or .rtf document via python?

pranab · February 8, 2023, 7:19pm

I am going through various documents of mine and would like to obtain the coordinates (bounding box information) from all of the paragraphs of a .rtf file. I think Aspose only works on docx files but that should be no issue since MS word easily converts .rtf to docx so lets just go with .docx files.

I am not too familiar with parsing the xml structure of a docx word document but I know that I can extract each paragraph’s text somewhat easily. I believe I should also be able to identify tables as well (but I think tables might be more complicated to get coordinates here).

How can I use Aspose.words to get the x,y coordinates of the first word in a paragraph and the last word in a paragraph so I can draw bounding boxes around them (see second link below of an example output I would like to make from getting bounding box coordinates of the paragraph and table objects)?

If I have to do some post-processing on the coordinates to make sure the bounding boxes surround the entire paragraph (given that paragraphs could be indented at the start) that is fine.

https://github.com/pranabislam/file_host/blob/main/example_page.docx
https://github.com/pranabislam/file_host/blob/main/example_page_with_bounding_boxes.png

I am new to aspose but I saw this post and I think this person is trying to do something similar? How can i get xy coordinates from word file with python?

Thank you!

alexey.noskov · February 9, 2023, 7:05am

@pranab

Aspose.Words supports a wide range of file formats including DOCX, RTF, binary DOC and many other. Please see our documentation to learn more:
https://docs.aspose.com/words/python-net/supported-document-formats/

You do not have to be familiar with XML to use Aspose.Words. While loading document Aspose.Words reads the document into Document Object Model.

As you may know MS Word documents are flow documents and does not contain any information about document layout. The consumer applications like MS Word or Open Office build the document layout on the fly. Aspose.Words has it’s own document layout engine. The facade classes LayoutCollector and LayoutEnumerator allows to get layout information of document elements.
Unfortunately, full usage of these classes is currently limited in Python version due the WORDSNET-24828. It is already resolved in the current codebase and the fix will be included into the next 23.2 version of Aspose.Words for Python. Once done, you will be able to use code like this to get bounding boxes of paragraphs and tables in your document:

import aspose.words as aw
import aspose.pydrawing as pydraw

# Open document
doc = aw.Document("C:\\Temp\\in.docx")

# Get all paragraphs in the document and wrap them into bookmakrs.
# This will allow to get bounds of paragraphs.
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True)
para_bookmakrs = []
i = 0
for node in paragraphs:
    p = node.as_paragraph()
    # Skip paragraphs which are in header footer (LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
    if p.get_ancestor(aw.NodeType.HEADER_FOOTER) is not None :
        continue

    # Skip paragraphs in tables since tables will be processed separately (due to your requirements)
    if p.get_ancestor(aw.NodeType.TABLE) is not None :
        continue

    bk_name = "tmp_bookmakr_" + str(i)
    para_bookmakrs.append(bk_name)
    i += 1
    # Create a temporary bookmark that wraps paragraph
    bk_start = aw.BookmarkStart(doc, bk_name)
    bk_end = aw.BookmarkEnd(doc, bk_name)
    p.prepend_child(bk_start)
    p.append_child(bk_end)

# Create LayoutCollector and LayoutEnumerator classes to get layout information of nodes.
collector = aw.layout.LayoutCollector(doc)
enumerator = aw.layout.LayoutEnumerator(doc)

# Now we can calculate 
for bk_name in para_bookmakrs:
    bk = doc.range.bookmarks.get_by_name(bk_name)
    # Move LayoutEnumerator to the line where bookmark start is located
    enumerator.set_current(collector, bk.bookmark_start)
    while enumerator.type != aw.layout.LayoutEntityType.LINE :
        enumerator.move_parent()
    # Get rectangle of the first line in the paragraph.
    first_rect = enumerator.rectangle
    # Do the same with bookmark End
    enumerator.set_current(collector, bk.bookmark_end)
    while enumerator.type != aw.layout.LayoutEntityType.LINE :
        enumerator.move_parent()
    # Get rectangle of the last line in the paragraph.
    last_rect = enumerator.rectangle
    # Union of the rectangles is the bounding box of the paragraph wrapped by bookmark.
    result_rect = pydraw.RectangleF.union(first_rect, last_rect)
    print("Paragraph rectangle : x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))

# Do the same with table
tables = doc.get_child_nodes(aw.NodeType.TABLE, True);
for node in tables :
    t = node.as_table()
    # Skip tables which are in header footer (LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
    if t.get_ancestor(aw.NodeType.HEADER_FOOTER) is not None :
        continue

    # Move LayoutEnumerator to the first row
    enumerator.set_current(collector, t.first_row.first_cell.first_paragraph)
    while enumerator.type != aw.layout.LayoutEntityType.ROW :
        enumerator.move_parent()
    # Get rectangle of the first row of the table.
    first_rect = enumerator.rectangle
    # Do the same with last row
    enumerator.set_current(collector, t.last_row.first_cell.first_paragraph)
    while enumerator.type != aw.layout.LayoutEntityType.ROW :
        enumerator.move_parent()
    # Get rectangle of the last row in the table.
    last_rect = enumerator.rectangle
    # Union of the rectangles is the bounding box of the table.
    result_rect = pydraw.RectangleF.union(first_rect, last_rect)
    print("Table rectangle : x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))

alexey.noskov · February 16, 2023, 2:31pm

@pranab We just released new 23.2 version of Aspose.Words for Python, which includes the fix of WORDSNET-24828. You can use the provided above code to get bounds of paragraphs and tables in your document.

swelly127 · November 6, 2024, 10:21pm

this doesn’t work, still only has the last line as the bounding box

alexey.noskov · November 7, 2024, 5:28am

@swelly127 Could you please attach your problematic input document and code that will allow us to reproduce the problem? We will check the issue and provide you more information.