How to get the getTextFragments in Word and Excel

kathirmsc85 · September 6, 2023, 9:12am

Hi Team,

I am using the Aspose Words and Aspose Cells/ Aspose Cells Python library to read the data.
I want to get the getTextFragments from these files, which i will use it for annotation purpose later.

So how to get the getTextFragments?

Thanks,
Kathiresh

alexey.noskov · September 6, 2023, 2:47pm

@kathirmsc85 Could you please attach your sample document here for our reference? We will check them and provide you more information.

amjad.sahi · September 6, 2023, 2:54pm

@kathirmsc85,

Please also share a sample MS Excel file and provide more information on which text fragments you want to get. We will review and assist you accordingly.

kathirmsc85 · September 27, 2023, 11:36am

Hi @amjad.sahi, @alexey.noskov,

Sorry for the late response.
Please find the attached documents.

Thanks,
Kathiresh Muthusamy
aspose_sample.docx (14.0 KB)
sample_aspose.zip (24.6 KB)

alexey.noskov · September 27, 2023, 12:40pm

@kathirmsc85 Thank you for additional information. As I can see there is no frames in your document, there are paragraphs. You can loop through the paragraphs in your document using the following code:

doc = aw.Document("C:\\Temp\\in.docx")
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True)
for node in paragraphs:
    para = node.as_paragraph()
    print(para.to_string(aw.SaveFormat.TEXT).strip())

amjad.sahi · September 27, 2023, 1:45pm

@kathirmsc85,

Thanks for the sample Excel files.

See the following sample code on how to get contents/data from different cells in the worksheet of Excel workbook via Aspose.Cells for Python via Java for your reference.
e.g.
Sample code:

# Load the MS Excel file.
workbook = Workbook("sample_aspose.xlsx")
worksheet = workbook.getWorksheets().get(0)
cells = worksheet.getCells()
for cell in cells:
    print(cell.getName() + ": " + cell.getStringValue())

Hope, this helps a bit.

kathirmsc85 · October 5, 2023, 6:49am

Hi @amjad.sahi, @alexey.noskov,

My request is not reading the data from word or excel.

My POC - get the text Fragments co ordinates of x-axis and y-axis like below. I can able to read it from PDF documents. But how to get it from aspose words and cells.

sample:

textFragment.getRectangle().getLLX()
textFragment.getRectangle().getLLY()
textFragment.getRectangle().getURX()
textFragment.getRectangle().getURY()

using this extracted details, i want to annotate words/sentence in word or excel document.

Thanks,
Kathiresh Muthusamy

alexey.noskov · October 5, 2023, 7:24am

@kathirmsc85 As you may know MS Word documents are flow documents and does not contain any information about document layout, so there is no concept of Text Fragment. The consumer applications like MS Word or Open Office build the document layout on the fly. Aspose.Words has it’s own document layout engine. The facade classes LayoutCollector and LayoutEnumerator allows to get layout information of document elements. For example the following code allows to calculate bounding boxes of paragraphs and tables in the document:

import aspose.words as aw
import aspose.pydrawing as pydraw

# Open document
doc = aw.Document("C:\\Temp\\in.docx")

# Get all paragraphs in the document and wrap them into bookmakrs.
# This will allow to get bounds of paragraphs.
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True)
para_bookmakrs = []
i = 0
for node in paragraphs:
    p = node.as_paragraph()
    # Skip paragraphs which are in header footer (LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
    if p.get_ancestor(aw.NodeType.HEADER_FOOTER) is not None :
        continue

    # Skip paragraphs in tables since tables will be processed separately (due to your requirements)
    if p.get_ancestor(aw.NodeType.TABLE) is not None :
        continue

    bk_name = "tmp_bookmakr_" + str(i)
    para_bookmakrs.append(bk_name)
    i += 1
    # Create a temporary bookmark that wraps paragraph
    bk_start = aw.BookmarkStart(doc, bk_name)
    bk_end = aw.BookmarkEnd(doc, bk_name)
    p.prepend_child(bk_start)
    p.append_child(bk_end)

# Create LayoutCollector and LayoutEnumerator classes to get layout information of nodes.
collector = aw.layout.LayoutCollector(doc)
enumerator = aw.layout.LayoutEnumerator(doc)

# Now we can calculate 
for bk_name in para_bookmakrs:
    bk = doc.range.bookmarks.get_by_name(bk_name)
    # Move LayoutEnumerator to the line where bookmark start is located
    enumerator.set_current(collector, bk.bookmark_start)
    while enumerator.type != aw.layout.LayoutEntityType.LINE :
        enumerator.move_parent()
    # Get rectangle of the first line in the paragraph.
    first_rect = enumerator.rectangle
    # Do the same with bookmark End
    enumerator.set_current(collector, bk.bookmark_end)
    while enumerator.type != aw.layout.LayoutEntityType.LINE :
        enumerator.move_parent()
    # Get rectangle of the last line in the paragraph.
    last_rect = enumerator.rectangle
    # Union of the rectangles is the bounding box of the paragraph wrapped by bookmark.
    result_rect = pydraw.RectangleF.union(first_rect, last_rect)
    print("Paragraph rectangle : x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))

# Do the same with table
tables = doc.get_child_nodes(aw.NodeType.TABLE, True);
for node in tables :
    t = node.as_table()
    # Skip tables which are in header footer (LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
    if t.get_ancestor(aw.NodeType.HEADER_FOOTER) is not None :
        continue

    # Move LayoutEnumerator to the first row
    enumerator.set_current(collector, t.first_row.first_cell.first_paragraph)
    while enumerator.type != aw.layout.LayoutEntityType.ROW :
        enumerator.move_parent()
    # Get rectangle of the first row of the table.
    first_rect = enumerator.rectangle
    # Do the same with last row
    enumerator.set_current(collector, t.last_row.first_cell.first_paragraph)
    while enumerator.type != aw.layout.LayoutEntityType.ROW :
        enumerator.move_parent()
    # Get rectangle of the last row in the table.
    last_rect = enumerator.rectangle
    # Union of the rectangles is the bounding box of the table.
    result_rect = pydraw.RectangleF.union(first_rect, last_rect)
    print("Table rectangle : x=" + str(result_rect.x) + ", y=" + str(result_rect.y) + ", width=" + str(result_rect.width) +", height=" + str(result_rect.height))

But if your goal is to add annotation or comment to a particular text in MS Word document, there is much easier way. For example see the following code:

doc = aw.Document("C:\\Temp\\in.docx")

word = "test"

# Use Range.replace method to make each searched word a separate Run node.
opt = aw.replacing.FindReplaceOptions()
opt.use_substitutions = True
doc.range.replace(word, "$0", opt)

# Get all runs
runs = doc.get_child_nodes(aw.NodeType.RUN, True)

for r in runs :
    run = r.as_run()
    # process the runs with text that matches the searched word.
    if run.text == word:
        # Crete a comment
        comment = aw.Comment(doc, "James Bond", "007", datetime.date.today())
        comment.paragraphs.add(aw.Paragraph(doc))
        comment.first_paragraph.runs.add(aw.Run(doc, "Comment text."))
        # Wrap the Run with CommentRangeStart and CommentRangeEnd
        run.parent_node.insert_before(aw.CommentRangeStart(doc, comment.id), run)
        run.parent_node.insert_after(aw.CommentRangeEnd(doc, comment.id), run)
        # Add a comment.
        run.parent_node.insert_after(comment, run)

doc.save("C:\\Temp\\out.docx")

amjad.sahi · October 5, 2023, 9:51am

@kathirmsc85,

In MS Excel, data is stored in cells. A cell is the intersection of a row and a column, it is the smallest unit of data storage in a worksheet. Moreover, in MS Excel, there is no Text Fragment for cell text as well, so, to evaluate x and y coordinate of a cell, you need to calculate the width and height of the (involved) columns and rows accordingly by yourselves. See the following sample code for your reference:
e.g.
Sample code:

# Load the MS Excel file.
workbook = Workbook("sample_aspose.xlsx")
worksheet = workbook.getWorksheets().get(0)
cells = worksheet.getCells()
for cell in cells:
	# Get row of the cell
	row = cell.getRow()
	# Get col of the cell
	col = cell.getColumn()

	# Get the text of the cell
	text = cell.getStringValue()

	# Get the width and height of the current column and row
	column_width = worksheet.getCells().getColumnWidthPixel(col)
	row_height = worksheet.getCells().getRowHeightPixel(row)

	# Calculate the x-coordinate and y-coordinate of the cell
	x_coordinate = sum(worksheet.getCells().getColumnWidthPixel(i) for i in range(col)) + (column_width / 2)
	y_coordinate = sum(worksheet.getCells().getRowHeightPixel(i) for i in range(row)) + (row_height / 2)

	# Display the text and coordinates
	print(cell.getName() + ":" + cell.getStringValue() + " X-Coordinate:" + str(x_coordinate) + " Y-Coordinate:" + str(y_coordinate))

Hope, this helps a bit.

kathirmsc85 · October 5, 2023, 10:12am

@alexey.noskov, @amjad.sahi, Thank you, will check this and keep you posted it.

Thanks,
Kathiresh

amjad.sahi · October 5, 2023, 10:14am

@kathirmsc85,

You are welcome. Please take your time to evaluate the suggested code segments. Feel free to write back to us if you have any further queries or comments.