Hi Team,
I am using the Aspose Words and Aspose Cells/ Aspose Cells Python library to read the data.
I want to get the getTextFragments from these files, which i will use it for annotation purpose later.
So how to get the getTextFragments?
Thanks,
Kathiresh
@kathirmsc85 Could you please attach your sample document here for our reference? We will check them and provide you more information.
@kathirmsc85,
Please also share a sample MS Excel file and provide more information on which text fragments you want to get. We will review and assist you accordingly.
Hi @amjad.sahi, @alexey.noskov,
Sorry for the late response.
Please find the attached documents.
Thanks,
Kathiresh Muthusamy
aspose_sample.docx (14.0 KB)
sample_aspose.zip (24.6 KB)
@kathirmsc85 Thank you for additional information. As I can see there is no frames in your document, there are paragraphs. You can loop through the paragraphs in your document using the following code:
doc = aw.Document("C:\\Temp\\in.docx")
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True)
for node in paragraphs:
para = node.as_paragraph()
print(para.to_string(aw.SaveFormat.TEXT).strip())
@kathirmsc85,
Thanks for the sample Excel files.
See the following sample code on how to get contents/data from different cells in the worksheet of Excel workbook via Aspose.Cells for Python via Java for your reference.
e.g.
Sample code:
# Load the MS Excel file.
workbook = Workbook("sample_aspose.xlsx")
worksheet = workbook.getWorksheets().get(0)
cells = worksheet.getCells()
for cell in cells:
print(cell.getName() + ": " + cell.getStringValue())
Hope, this helps a bit.
Hi @amjad.sahi, @alexey.noskov,
My request is not reading the data from word or excel.
My POC - get the text Fragments co ordinates of x-axis and y-axis like below. I can able to read it from PDF documents. But how to get it from aspose words and cells.
sample:
textFragment.getRectangle().getLLX()
textFragment.getRectangle().getLLY()
textFragment.getRectangle().getURX()
textFragment.getRectangle().getURY()
using this extracted details, i want to annotate words/sentence in word or excel document.
Thanks,
Kathiresh Muthusamy
@kathirmsc85 As you may know MS Word documents are flow documents and does not contain any information about document layout, so there is no concept of Text Fragment
. The consumer applications like MS Word or Open Office build the document layout on the fly. Aspose.Words has it’s own document layout engine. The facade classes LayoutCollector and LayoutEnumerator allows to get layout information of document elements. For example the following code allows to calculate bounding boxes of paragraphs and tables in the document:
import aspose.words as aw
import aspose.pydrawing as pydraw
# Open document
doc = aw.Document("C:\\Temp\\in.docx")
# Get all paragraphs in the document and wrap them into bookmakrs.
# This will allow to get bounds of paragraphs.
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True)
para_bookmakrs = []
i = 0
for node in paragraphs:
p = node.as_paragraph()
# Skip paragraphs which are in header footer (LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
if p.get_ancestor(aw.NodeType.HEADER_FOOTER) is not None :
continue
# Skip paragraphs in tables since tables will be processed separately (due to your requirements)
if p.get_ancestor(aw.NodeType.TABLE) is not None :
continue
bk_name = "tmp_bookmakr_" + str(i)
para_bookmakrs.append(bk_name)
i += 1
# Create a temporary bookmark that wraps paragraph
bk_start = aw.BookmarkStart(doc, bk_name)
bk_end = aw.BookmarkEnd(doc, bk_name)
p.prepend_child(bk_start)
p.append_child(bk_end)
# Create LayoutCollector and LayoutEnumerator classes to get layout information of nodes.
collector = aw.layout.LayoutCollector(doc)
enumerator = aw.layout.LayoutEnumerator(doc)
# Now we can calculate
for bk_name in para_bookmakrs:
bk = doc.range.bookmarks.get_by_name(bk_name)
# Move LayoutEnumerator to the line where bookmark start is located
enumerator.set_current(collector, bk.bookmark_start)
while enumerator.type != aw.layout.LayoutEntityType.LINE :
enumerator.move_parent()
# Get rectangle of the first line in the paragraph.
first_rect = enumerator.rectangle
# Do the same with bookmark End
enumerator.set_current(collector, bk.bookmark_end)
while enumerator.type != aw.layout.LayoutEntityType.LINE :
enumerator.move_parent()
# Get rectangle of the last line in the paragraph.
last_rect = enumerator.rectangle
# Union of the rectangles is the bounding box of the paragraph wrapped by bookmark.
result_rect = pydraw.RectangleF.union(first_rect, last_rect)
print("Paragraph rectangle : x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))
# Do the same with table
tables = doc.get_child_nodes(aw.NodeType.TABLE, True);
for node in tables :
t = node.as_table()
# Skip tables which are in header footer (LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
if t.get_ancestor(aw.NodeType.HEADER_FOOTER) is not None :
continue
# Move LayoutEnumerator to the first row
enumerator.set_current(collector, t.first_row.first_cell.first_paragraph)
while enumerator.type != aw.layout.LayoutEntityType.ROW :
enumerator.move_parent()
# Get rectangle of the first row of the table.
first_rect = enumerator.rectangle
# Do the same with last row
enumerator.set_current(collector, t.last_row.first_cell.first_paragraph)
while enumerator.type != aw.layout.LayoutEntityType.ROW :
enumerator.move_parent()
# Get rectangle of the last row in the table.
last_rect = enumerator.rectangle
# Union of the rectangles is the bounding box of the table.
result_rect = pydraw.RectangleF.union(first_rect, last_rect)
print("Table rectangle : x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))
But if your goal is to add annotation or comment to a particular text in MS Word document, there is much easier way. For example see the following code:
doc = aw.Document("C:\\Temp\\in.docx")
word = "test"
# Use Range.replace method to make each searched word a separate Run node.
opt = aw.replacing.FindReplaceOptions()
opt.use_substitutions = True
doc.range.replace(word, "$0", opt)
# Get all runs
runs = doc.get_child_nodes(aw.NodeType.RUN, True)
for r in runs :
run = r.as_run()
# process the runs with text that matches the searched word.
if run.text == word:
# Crete a comment
comment = aw.Comment(doc, "James Bond", "007", datetime.date.today())
comment.paragraphs.add(aw.Paragraph(doc))
comment.first_paragraph.runs.add(aw.Run(doc, "Comment text."))
# Wrap the Run with CommentRangeStart and CommentRangeEnd
run.parent_node.insert_before(aw.CommentRangeStart(doc, comment.id), run)
run.parent_node.insert_after(aw.CommentRangeEnd(doc, comment.id), run)
# Add a comment.
run.parent_node.insert_after(comment, run)
doc.save("C:\\Temp\\out.docx")
@kathirmsc85,
In MS Excel, data is stored in cells. A cell is the intersection of a row and a column, it is the smallest unit of data storage in a worksheet. Moreover, in MS Excel, there is no Text Fragment for cell text as well, so, to evaluate x and y coordinate of a cell, you need to calculate the width and height of the (involved) columns and rows accordingly by yourselves. See the following sample code for your reference:
e.g.
Sample code:
# Load the MS Excel file.
workbook = Workbook("sample_aspose.xlsx")
worksheet = workbook.getWorksheets().get(0)
cells = worksheet.getCells()
for cell in cells:
# Get row of the cell
row = cell.getRow()
# Get col of the cell
col = cell.getColumn()
# Get the text of the cell
text = cell.getStringValue()
# Get the width and height of the current column and row
column_width = worksheet.getCells().getColumnWidthPixel(col)
row_height = worksheet.getCells().getRowHeightPixel(row)
# Calculate the x-coordinate and y-coordinate of the cell
x_coordinate = sum(worksheet.getCells().getColumnWidthPixel(i) for i in range(col)) + (column_width / 2)
y_coordinate = sum(worksheet.getCells().getRowHeightPixel(i) for i in range(row)) + (row_height / 2)
# Display the text and coordinates
print(cell.getName() + ":" + cell.getStringValue() + " X-Coordinate:" + str(x_coordinate) + " Y-Coordinate:" + str(y_coordinate))
Hope, this helps a bit.
@alexey.noskov, @amjad.sahi, Thank you, will check this and keep you posted it.
Thanks,
Kathiresh
@kathirmsc85,
You are welcome. Please take your time to evaluate the suggested code segments. Feel free to write back to us if you have any further queries or comments.