How can i get xy coordinates from word file with python?

mkq10 · January 31, 2022, 5:50am

Hi, I want to get XY coordinates of all the content/text with python. Is there any code example for this ? And if it’s not possible in the word document. Can we do this in pdf via python? Thanks

mkq10 · January 31, 2022, 6:56am

Hi, How can I get XY coordinates of all the text or some paragraphs with python. Is there any code example? And Also can we do this in pdf via python? Thanks

alexey.noskov · January 31, 2022, 8:55am

@mkq10 You can use aspose.words.layout.LayoutEnumerator class to get layout information of nodes. Please see the example test_layout_enumerator on our GitHub.
Please feel free to ask in case of any issues, We will be glad to help you.
Also, could you please let us know form where did you know about Python version of Aspose.Words.

mkq10 · February 1, 2022, 8:22am

Thank you for the response. I will be very thankful if you can just provide me xy coordinates example.
when i run test_layout_enumerator example from Github. It shows this error
ModuleNotFoundError: No module named ‘aspose.words’; ‘aspose’ is not a package
Regarding Python version of Aspose, i just did some research and found this solution working

alexey.noskov · February 1, 2022, 5:27pm

@mkq10 To run code examples from our github, please, clone whole repository, then change directory to \Aspose.Words-for-Python-via-.NET\Examples\ApiExamples and then run the following command:

python -m unittest ex_layout.ExLayout.test_layout_enumerator

Also, here is simplified version of this example:

import aspose.words as aw

class TestLayout:

    @staticmethod
    def traverse_layout_forward(layout_enumerator: aw.layout.LayoutEnumerator, depth: int):
        """Enumerate through layout_enumerator's layout entity collection front-to-back,
        in a depth-first manner, and in the "Visual" order."""

        while True:
            TestLayout.print_current_entity(layout_enumerator, depth)

            if layout_enumerator.move_first_child():
                TestLayout.traverse_layout_forward(layout_enumerator, depth + 1)
                layout_enumerator.move_parent()

            if not layout_enumerator.move_next():
                break

    @staticmethod
    def print_current_entity(layout_enumerator: aw.layout.LayoutEnumerator, indent: int):
        """Print information about layout_enumerator's current entity to the console, while indenting the text with tab characters
        based on its depth relative to the root node that we provided in the constructor LayoutEnumerator instance.
        The rectangle that we process at the end represents the area and location that the entity takes up in the document."""

        tabs = "\t" * indent

        if layout_enumerator.kind == "":
            print(f"{tabs}-> Entity type: {layout_enumerator.type}")
        else:
            print(f"{tabs}-> Entity type & kind: {layout_enumerator.type}, {layout_enumerator.kind}")

        # Only spans can contain text.
        if layout_enumerator.type == aw.layout.LayoutEntityType.SPAN:
            print(f"{tabs}   Span contents: \"{layout_enumerator.text}\"")

        le_rect = layout_enumerator.rectangle
        print(f"{tabs}   Rectangle dimensions {le_rect.width}x{le_rect.height}, X={le_rect.x} Y={le_rect.y}")
        print(f"{tabs}   Page {layout_enumerator.page_index}")




doc = aw.Document("C:\\Temp\\in.docx")
layout_enumerator = aw.layout.LayoutEnumerator(doc)
layout_enumerator.move_parent(aw.layout.LayoutEntityType.PAGE)
# We can call this method to make sure that the enumerator will be at the first layout entity.
layout_enumerator.reset()
print("Traversing from first to last, elements between pages separated:")
TestLayout.traverse_layout_forward(layout_enumerator, 1)

mkq10 · February 2, 2022, 4:42am

Ok i just cloned it but its still not working. But simplified version of code seems to work. Here please check the error.
image.png (19.9 KB)

mkq10 · February 2, 2022, 5:27am

One more thing i want you to guide me. How can i save document with these rectangle dimensions?
for whole document and some specific text. Thanks

alexey.noskov · February 2, 2022, 8:40am

@mkq10 The error message looks like aspose-words package is not installed.

Could you please elaborate your question a bit more? What is your ultimate goal? It would be great if you attach your input and the expected output documents (you can create is manually). We will check and provide you more information.

mkq10 · February 2, 2022, 10:15am

But its installed. I am getting result of above code with it. but whenever i try to run examples from github it shows this error
Here is an example image, i want to save document like this, with rectangle dimensions we got with the help of above given code.
image.png (1.8 KB)

mkq10 · February 2, 2022, 10:16am

Also i am not able to get coordinates for table text. Can i get that too. ?

alexey.noskov · February 2, 2022, 1:39pm

@mkq10 Still it is not quite clear how coordinates must be exported into the output document. Probably in your case you can simply convert your document into one of the supported Fixed Page formats (PDF, XPS, HtmlFixed). For example if you convert your document to HtmlFixed format all content in the output document will have absolute position specified:

import aspose.words as aw

doc = aw.Document("C:\\Temp\\in.docx")
doc.save("C:\\Temp\\out.html", aw.SaveFormat.HTML_FIXED)

Here is output example:

<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta name="generator" content="Aspose.Words for Python via .NET 21.11.0" />
    <title></title>
    <link rel="stylesheet" type="text/css" href="out/styles.css" media="all" />
</head>
<body>
    <div class="awdiv awpage" style="width:612pt; height:792pt;">
        <div class="awdiv" style="left:72pt; top:72pt;">
            <span class="awspan awtext001" style="font-size:11pt; left:0pt; top:0pt;">Hello, World!!!</span>
        </div>
    </div>
</body>
</html>

mkq10 · February 2, 2022, 6:57pm

Thank you for bearing it. And no i dont want to export coordinates in output document. I want to apply rectangles over all of the coordinates we get so that i can check(which text have coordinates and which haven’t).it in the output document. In other words, the output document will have rectangles over the text.

alexey.noskov · February 3, 2022, 12:50pm

@mkq10 Thank you for additional information. I think in this case the simples way to achieve what you need is converting document to raster image and then drawing rectangles returned by LayoutEnumerator on this image. for example see the following code:

import aspose.words as aw
from PIL import Image, ImageDraw

class TestLayout:

    @staticmethod
    def traverse_layout_forward(layout_enumerator: aw.layout.LayoutEnumerator, draw : ImageDraw):
        """Enumerate through layout_enumerator's layout entity collection front-to-back,
        in a depth-first manner, and in the "Visual" order."""

        while True:
            TestLayout.outline_current_entity(layout_enumerator, draw)

            if layout_enumerator.move_first_child():
                TestLayout.traverse_layout_forward(layout_enumerator, draw)
                layout_enumerator.move_parent()

            if not layout_enumerator.move_next():
                break

    @staticmethod
    def outline_current_entity(layout_enumerator: aw.layout.LayoutEnumerator, draw : ImageDraw):
        
        if layout_enumerator.type == aw.layout.LayoutEntityType.SPAN :
            le_rect = layout_enumerator.rectangle
            x = aw.ConvertUtil.point_to_pixel(le_rect.x)
            y = aw.ConvertUtil.point_to_pixel(le_rect.y)
            x1 = aw.ConvertUtil.point_to_pixel(le_rect.x+le_rect.width)
            y1 = aw.ConvertUtil.point_to_pixel(le_rect.y+le_rect.height)
            draw.rectangle([x,y,x1,y1], outline=200)


lic = aw.License()
lic.set_license("X:\\awnet\\TestData\\Licenses\\Aspose.Words.Python.NET.lic")

doc = aw.Document("C:\\Temp\\in.docx")
# Save document as an image
doc.save("C:\\Temp\\out.jpeg")

layout_enumerator = aw.layout.LayoutEnumerator(doc)
layout_enumerator.move_parent(aw.layout.LayoutEntityType.PAGE)
# We can call this method to make sure that the enumerator will be at the first layout entity.
layout_enumerator.reset()

with Image.open("C:\\Temp\\out.jpeg") as im:

    draw = ImageDraw.Draw(im)
    TestLayout.traverse_layout_forward(layout_enumerator, draw)
    im.save("C:\\Temp\\out_modified.jpeg", "JPEG")

Here is input document and output images produced by this code: in.docx (12.3 KB)
out.jpeg (16.9 KB)
out_modified.jpeg (15.3 KB)

I have used Pillow to edit image.

mkq10 · February 4, 2022, 2:25pm

Thank you for this additional information. But i am trying to keep OCR version so in this image case, It will not help. But appreciate what you did for me.
Also can you explain how are you getting coordinates from table content/text. ?

alexey.noskov · February 5, 2022, 7:46am

@mkq10 Here is the modified code that adds red boxes around the text in MS Word document. The code creates a floating rectangle shape with transparent background and adds them in the document:

import aspose.words as aw
import aspose.pydrawing as drawing

class TestLayout:

    @staticmethod
    def traverse_layout_forward(layout_enumerator: aw.layout.LayoutEnumerator, doc : aw.Document):
        """Enumerate through layout_enumerator's layout entity collection front-to-back,
        in a depth-first manner, and in the "Visual" order."""

        while True:
            TestLayout.outline_current_entity(layout_enumerator, doc)

            if layout_enumerator.move_first_child():
                TestLayout.traverse_layout_forward(layout_enumerator, doc)
                layout_enumerator.move_parent()

            if not layout_enumerator.move_next():
                break

    @staticmethod
    def outline_current_entity(layout_enumerator: aw.layout.LayoutEnumerator, doc : aw.Document):
        
        if layout_enumerator.type == aw.layout.LayoutEntityType.SPAN :
            le_rect = layout_enumerator.rectangle

            rect = aw.drawing.Shape(doc, aw.drawing.ShapeType.RECTANGLE)
            rect.relative_horizontal_position = aw.drawing.RelativeHorizontalPosition.PAGE
            rect.relative_vertical_position = aw.drawing.RelativeVerticalPosition.PAGE
            rect.wrap_type = aw.drawing.WrapType.NONE
            rect.left = le_rect.x
            rect.top = le_rect.y
            rect.width = le_rect.width
            rect.height = le_rect.height
            rect.fill.opacity = 0
            rect.stroke.color = drawing.Color.red

            doc.first_section.body.first_paragraph.append_child(rect)


lic = aw.License()
lic.set_license("X:\\awnet\\TestData\\Licenses\\Aspose.Words.Python.NET.lic")

doc = aw.Document("C:\\Temp\\in.docx")

layout_enumerator = aw.layout.LayoutEnumerator(doc)
layout_enumerator.move_parent(aw.layout.LayoutEntityType.PAGE)
# We can call this method to make sure that the enumerator will be at the first layout entity.
layout_enumerator.reset()

TestLayout.traverse_layout_forward(layout_enumerator, doc)

# write to stdout
doc.save("C:\\Temp\\out_modified.docx")

here are input and output documents: in.docx (12.3 KB)
out_modified.docx (10.4 KB)

I have used LayoutEnumerator to traverse the layout entries and it returns coordinates of table content. If this does not work in your case, there can be something wrong with your document. Could you please attach your document here for testing? We will check it and provide you more information.

mkq10 · February 9, 2022, 8:24am

Does set_license code matters ? Because i am not adding license code. (Using trial)… And yes my code is same as yours and I tested your example document file. And it returns content of table as well. There is something wrong may be with my document.

alexey.noskov · February 9, 2022, 9:27am

@mkq10 Yes, set_license code matters. If you do not set license Aspose.Words works in evaluation mode and limits the maximum number of paragraphs in the document. So your document might be truncated and the end. If you would like to test Aspose.Words without evaluation version limitations, you can request a temporary 30-days license.
Could you please also attach your document here for testing? I will check the scenario on my side and provide you more information.

mkq10 · February 9, 2022, 11:13am

Thank you very much. Please check this document and test. I am not getting all the content from here. Please print the results in a text file for me if you get all of the content. Thanks
Aspose Check.docx (25.6 KB)

alexey.noskov · February 9, 2022, 1:35pm

@mkq10 Thank you for additional information. I have managed to reproduce the problem and logged it as WORDSNET-23444 for a sake of correction. The problem occurs because tables in your document are inside text box shapes and LayoutEnumerator does not visit content inside text box shapes. We will investigate the problem and provide you more information.

mkq10 · February 9, 2022, 2:14pm

Thank you very much. I am waiting for the solution. I tried to copy a table from this document and added in another to test and that worked. But I have these kind of documents (text boxes shaped). And I am waiting for the solution from you.