Hi, I want to get XY coordinates of all the content/text with python. Is there any code example for this ? And if it’s not possible in the word document. Can we do this in pdf via python? Thanks
Hi, How can I get XY coordinates of all the text or some paragraphs with python. Is there any code example? And Also can we do this in pdf via python? Thanks
@mkq10 You can use aspose.words.layout.LayoutEnumerator
class to get layout information of nodes. Please see the example test_layout_enumerator
on our GitHub.
Please feel free to ask in case of any issues, We will be glad to help you.
Also, could you please let us know form where did you know about Python version of Aspose.Words.
Thank you for the response. I will be very thankful if you can just provide me xy coordinates example.
when i run test_layout_enumerator example from Github. It shows this error
ModuleNotFoundError: No module named ‘aspose.words’; ‘aspose’ is not a package
Regarding Python version of Aspose, i just did some research and found this solution working
@mkq10 To run code examples from our github, please, clone whole repository, then change directory to \Aspose.Words-for-Python-via-.NET\Examples\ApiExamples
and then run the following command:
python -m unittest ex_layout.ExLayout.test_layout_enumerator
Also, here is simplified version of this example:
import aspose.words as aw
class TestLayout:
@staticmethod
def traverse_layout_forward(layout_enumerator: aw.layout.LayoutEnumerator, depth: int):
"""Enumerate through layout_enumerator's layout entity collection front-to-back,
in a depth-first manner, and in the "Visual" order."""
while True:
TestLayout.print_current_entity(layout_enumerator, depth)
if layout_enumerator.move_first_child():
TestLayout.traverse_layout_forward(layout_enumerator, depth + 1)
layout_enumerator.move_parent()
if not layout_enumerator.move_next():
break
@staticmethod
def print_current_entity(layout_enumerator: aw.layout.LayoutEnumerator, indent: int):
"""Print information about layout_enumerator's current entity to the console, while indenting the text with tab characters
based on its depth relative to the root node that we provided in the constructor LayoutEnumerator instance.
The rectangle that we process at the end represents the area and location that the entity takes up in the document."""
tabs = "\t" * indent
if layout_enumerator.kind == "":
print(f"{tabs}-> Entity type: {layout_enumerator.type}")
else:
print(f"{tabs}-> Entity type & kind: {layout_enumerator.type}, {layout_enumerator.kind}")
# Only spans can contain text.
if layout_enumerator.type == aw.layout.LayoutEntityType.SPAN:
print(f"{tabs} Span contents: \"{layout_enumerator.text}\"")
le_rect = layout_enumerator.rectangle
print(f"{tabs} Rectangle dimensions {le_rect.width}x{le_rect.height}, X={le_rect.x} Y={le_rect.y}")
print(f"{tabs} Page {layout_enumerator.page_index}")
doc = aw.Document("C:\\Temp\\in.docx")
layout_enumerator = aw.layout.LayoutEnumerator(doc)
layout_enumerator.move_parent(aw.layout.LayoutEntityType.PAGE)
# We can call this method to make sure that the enumerator will be at the first layout entity.
layout_enumerator.reset()
print("Traversing from first to last, elements between pages separated:")
TestLayout.traverse_layout_forward(layout_enumerator, 1)
Ok i just cloned it but its still not working. But simplified version of code seems to work. Here please check the error.
image.png (19.9 KB)
One more thing i want you to guide me. How can i save document with these rectangle dimensions?
for whole document and some specific text. Thanks
@mkq10 The error message looks like aspose-words
package is not installed.
Could you please elaborate your question a bit more? What is your ultimate goal? It would be great if you attach your input and the expected output documents (you can create is manually). We will check and provide you more information.
But its installed. I am getting result of above code with it. but whenever i try to run examples from github it shows this error
Here is an example image, i want to save document like this, with rectangle dimensions we got with the help of above given code.
image.png (1.8 KB)
Also i am not able to get coordinates for table text. Can i get that too. ?
@mkq10 Still it is not quite clear how coordinates must be exported into the output document. Probably in your case you can simply convert your document into one of the supported Fixed Page formats (PDF, XPS, HtmlFixed). For example if you convert your document to HtmlFixed format all content in the output document will have absolute position specified:
import aspose.words as aw
doc = aw.Document("C:\\Temp\\in.docx")
doc.save("C:\\Temp\\out.html", aw.SaveFormat.HTML_FIXED)
Here is output example:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Aspose.Words for Python via .NET 21.11.0" />
<title></title>
<link rel="stylesheet" type="text/css" href="out/styles.css" media="all" />
</head>
<body>
<div class="awdiv awpage" style="width:612pt; height:792pt;">
<div class="awdiv" style="left:72pt; top:72pt;">
<span class="awspan awtext001" style="font-size:11pt; left:0pt; top:0pt;">Hello, World!!!</span>
</div>
</div>
</body>
</html>
Thank you for bearing it. And no i dont want to export coordinates in output document. I want to apply rectangles over all of the coordinates we get so that i can check(which text have coordinates and which haven’t).it in the output document. In other words, the output document will have rectangles over the text.
@mkq10 Thank you for additional information. I think in this case the simples way to achieve what you need is converting document to raster image and then drawing rectangles returned by LayoutEnumerator
on this image. for example see the following code:
import aspose.words as aw
from PIL import Image, ImageDraw
class TestLayout:
@staticmethod
def traverse_layout_forward(layout_enumerator: aw.layout.LayoutEnumerator, draw : ImageDraw):
"""Enumerate through layout_enumerator's layout entity collection front-to-back,
in a depth-first manner, and in the "Visual" order."""
while True:
TestLayout.outline_current_entity(layout_enumerator, draw)
if layout_enumerator.move_first_child():
TestLayout.traverse_layout_forward(layout_enumerator, draw)
layout_enumerator.move_parent()
if not layout_enumerator.move_next():
break
@staticmethod
def outline_current_entity(layout_enumerator: aw.layout.LayoutEnumerator, draw : ImageDraw):
if layout_enumerator.type == aw.layout.LayoutEntityType.SPAN :
le_rect = layout_enumerator.rectangle
x = aw.ConvertUtil.point_to_pixel(le_rect.x)
y = aw.ConvertUtil.point_to_pixel(le_rect.y)
x1 = aw.ConvertUtil.point_to_pixel(le_rect.x+le_rect.width)
y1 = aw.ConvertUtil.point_to_pixel(le_rect.y+le_rect.height)
draw.rectangle([x,y,x1,y1], outline=200)
lic = aw.License()
lic.set_license("X:\\awnet\\TestData\\Licenses\\Aspose.Words.Python.NET.lic")
doc = aw.Document("C:\\Temp\\in.docx")
# Save document as an image
doc.save("C:\\Temp\\out.jpeg")
layout_enumerator = aw.layout.LayoutEnumerator(doc)
layout_enumerator.move_parent(aw.layout.LayoutEntityType.PAGE)
# We can call this method to make sure that the enumerator will be at the first layout entity.
layout_enumerator.reset()
with Image.open("C:\\Temp\\out.jpeg") as im:
draw = ImageDraw.Draw(im)
TestLayout.traverse_layout_forward(layout_enumerator, draw)
im.save("C:\\Temp\\out_modified.jpeg", "JPEG")
Here is input document and output images produced by this code: in.docx (12.3 KB)
out.jpeg (16.9 KB)
out_modified.jpeg (15.3 KB)
I have used Pillow to edit image.
Thank you for this additional information. But i am trying to keep OCR version so in this image case, It will not help. But appreciate what you did for me.
Also can you explain how are you getting coordinates from table content/text. ?
@mkq10 Here is the modified code that adds red boxes around the text in MS Word document. The code creates a floating rectangle shape with transparent background and adds them in the document:
import aspose.words as aw
import aspose.pydrawing as drawing
class TestLayout:
@staticmethod
def traverse_layout_forward(layout_enumerator: aw.layout.LayoutEnumerator, doc : aw.Document):
"""Enumerate through layout_enumerator's layout entity collection front-to-back,
in a depth-first manner, and in the "Visual" order."""
while True:
TestLayout.outline_current_entity(layout_enumerator, doc)
if layout_enumerator.move_first_child():
TestLayout.traverse_layout_forward(layout_enumerator, doc)
layout_enumerator.move_parent()
if not layout_enumerator.move_next():
break
@staticmethod
def outline_current_entity(layout_enumerator: aw.layout.LayoutEnumerator, doc : aw.Document):
if layout_enumerator.type == aw.layout.LayoutEntityType.SPAN :
le_rect = layout_enumerator.rectangle
rect = aw.drawing.Shape(doc, aw.drawing.ShapeType.RECTANGLE)
rect.relative_horizontal_position = aw.drawing.RelativeHorizontalPosition.PAGE
rect.relative_vertical_position = aw.drawing.RelativeVerticalPosition.PAGE
rect.wrap_type = aw.drawing.WrapType.NONE
rect.left = le_rect.x
rect.top = le_rect.y
rect.width = le_rect.width
rect.height = le_rect.height
rect.fill.opacity = 0
rect.stroke.color = drawing.Color.red
doc.first_section.body.first_paragraph.append_child(rect)
lic = aw.License()
lic.set_license("X:\\awnet\\TestData\\Licenses\\Aspose.Words.Python.NET.lic")
doc = aw.Document("C:\\Temp\\in.docx")
layout_enumerator = aw.layout.LayoutEnumerator(doc)
layout_enumerator.move_parent(aw.layout.LayoutEntityType.PAGE)
# We can call this method to make sure that the enumerator will be at the first layout entity.
layout_enumerator.reset()
TestLayout.traverse_layout_forward(layout_enumerator, doc)
# write to stdout
doc.save("C:\\Temp\\out_modified.docx")
here are input and output documents: in.docx (12.3 KB)
out_modified.docx (10.4 KB)
I have used LayoutEnumerator
to traverse the layout entries and it returns coordinates of table content. If this does not work in your case, there can be something wrong with your document. Could you please attach your document here for testing? We will check it and provide you more information.
Does set_license code matters ? Because i am not adding license code. (Using trial)… And yes my code is same as yours and I tested your example document file. And it returns content of table as well. There is something wrong may be with my document.
@mkq10 Yes, set_license code matters. If you do not set license Aspose.Words works in evaluation mode and limits the maximum number of paragraphs in the document. So your document might be truncated and the end. If you would like to test Aspose.Words without evaluation version limitations, you can request a temporary 30-days license.
Could you please also attach your document here for testing? I will check the scenario on my side and provide you more information.
Thank you very much. Please check this document and test. I am not getting all the content from here. Please print the results in a text file for me if you get all of the content. Thanks
Aspose Check.docx (25.6 KB)
@mkq10 Thank you for additional information. I have managed to reproduce the problem and logged it as WORDSNET-23444 for a sake of correction. The problem occurs because tables in your document are inside text box shapes and LayoutEnumerator
does not visit content inside text box shapes. We will investigate the problem and provide you more information.
Thank you very much. I am waiting for the solution. I tried to copy a table from this document and added in another to test and that worked. But I have these kind of documents (text boxes shaped). And I am waiting for the solution from you.