Compare words then get coordinates changed in input file

nguyentruc · October 24, 2023, 9:24am

Hi all,
how to get line or coordinates of SourceFile and TargetFile when compare words?
Thanksss

alexey.noskov · October 24, 2023, 11:36am

@nguyentruc There is no direct way to achieve this. As you may know MS Word documents are flow documents and does not contain any information about document layout. The consumer applications like MS Word or Open Office build the document layout on the fly. Aspose.Words has it’s own document layout engine. The facade classes LayoutCollector and LayoutEnumerator allows to get layout information of document elements.
After comparing documents, the changes are marked with revisions in the resulting documents, so the task is to determine coordinates of nodes with revisions. For example see the following code:

original = aw.Document("C:\\Temp\\original.docx")
changed = aw.Document("C:\\Temp\\changed.docx")

# compare documents.
original.compare(changed, "test", datetime.date.today())
tmp_bookmarks = []
run_nodes = original.get_child_nodes(aw.NodeType.RUN, True)
bk_counter = 0
for run_node in run_nodes:
    run = run_node.as_run()
    # LayoutCollector and LayoutEnumerator do not work with nodes in header/footer, so skip them.
    if(run.get_ancestor(aw.NodeType.HEADER_FOOTER) == None) and (run.is_delete_revision or run.is_insert_revision or run.is_format_revision):
        bk_name = "_tmp_bk_"+str(bk_counter)
        bk_counter = bk_counter + 1
        run.parent_node.insert_before(aw.BookmarkStart(original, bk_name), run)
        run.parent_node.insert_after(aw.BookmarkEnd(original, bk_name), run)
        tmp_bookmarks.append(bk_name)


# create LayoutCollector and LayoutEnumerator to get coordinates of revisions in the resulting document.
collector = aw.layout.LayoutCollector(original)
enumerator = aw.layout.LayoutEnumerator(original)

for bk_name in tmp_bookmarks:
    bk = original.range.bookmarks.get_by_name(bk_name)
    # Move LayoutEnumerator to the line where bookmark start is located
    enumerator.set_current(collector, bk.bookmark_start)
    first_rect = enumerator.rectangle
    # Do the same with bookmark End
    enumerator.set_current(collector, bk.bookmark_end)
    last_rect = enumerator.rectangle
    # Union of the rectangles is the bounding box of the run wrapped by bookmark.
    result_rect = pydraw.RectangleF.union(first_rect, last_rect)
    print("Revision page: " + str(enumerator.page_index) +  " rectangle: x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))

nguyentruc · October 25, 2023, 9:32am

oh thank @alexey.noskov, can you help me save capture image with (x, y, width, height) above.
I tried saving image croped but it didn’t work.
Thankss

alexey.noskov · October 25, 2023, 12:22pm

@nguyentruc You can use the following code to get the image of the specified area on the page:

import aspose.words as aw
from PIL import Image

doc = aw.Document("C:\\Temp\\in.docx")

# Save page as an Image
resolution = 300
imgSaveOptions = aw.saving.ImageSaveOptions(aw.SaveFormat.JPEG)
imgSaveOptions = aw.saving.ImageSaveOptions(aw.SaveFormat.JPEG)
imgSaveOptions.horizontal_resolution = resolution
imgSaveOptions.vertical_resolution = resolution
imgSaveOptions.page_set = aw.saving.PageSet(0)
doc.save("C:\\Temp\\tmp.jpg", imgSaveOptions)

# Now get the area of the image we need
img = Image.open("C:\\Temp\\tmp.jpg")
# coordinates returned by Aspose.Words are in Points, PIL uses pixels so it is required to convert units
left = aw.ConvertUtil.point_to_pixel(100, resolution)
top = aw.ConvertUtil.point_to_pixel(100, resolution)
right = aw.ConvertUtil.point_to_pixel(300, resolution)
bottom = aw.ConvertUtil.point_to_pixel(300, resolution)
crop_rectangle = (left, top, right, bottom)
cropped_im = img.crop(crop_rectangle)

cropped_im.save("C:\\Temp\\cropped.jpg")