Iterate through all objects in word document that have text in order

ln22 · January 16, 2025, 7:27pm

Hello,

How can I iterate through all of a word document in order pulling out text? My goal would be to extract all text from the document in order.

I would love this code to be written in python with Aspose.Words for Python via .NET.

Professionalize.Discourse · January 16, 2025, 7:28pm

@ln22

To iterate through all objects in a Word document and extract text in order using Aspose.Words for Python via .NET, you can utilize the Document Object Model (DOM) provided by the library. Below is a code example that demonstrates how to achieve this:

import aspose.words as aw

# Load the Word document
doc = aw.Document("path/to/your/document.docx")

# Create a list to hold the extracted text
extracted_text = []

# Iterate through all paragraphs in the document
for paragraph in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    # Iterate through all runs in the paragraph
    for run in paragraph.get_child_nodes(aw.NodeType.RUN, True):
        extracted_text.append(run.text)

# Join the extracted text into a single string
full_text = "\n".join(extracted_text)

# Print the extracted text
print(full_text)

Explanation:

Loading the Document: The document is loaded using the Document class.
Iterating Through Nodes: The code iterates through all paragraph nodes in the document. For each paragraph, it further iterates through the run nodes, which contain the actual text.
Extracting Text: The text from each run is collected into a list.
Output: Finally, the extracted text is joined into a single string and printed.

This approach ensures that you extract all text from the document in the order it appears. You can modify the path to the document as needed.

For more detailed information on working with the Aspose.Words DOM, you can refer to the official documentation here which provides insights into navigating and manipulating document elements.

Sources:

[1]: Aspose.Words DOM|Aspose.Words for Python via .NET - Aspose Documentation

ln22 · January 16, 2025, 7:32pm

The following code for run in paragraph.get_child_nodes(aw.NodeType.RUN, True): gives the errors:

‘aspose.words.Node’ object has no attribute ‘get_child_nodes’

This option might miss text. I am looking to understand also understand where the text is located. For example, headers, footers, footnotes, plain paragraphs, in a shape (like a text box) and tables.

I would also want to know how to pull out styling information on the text, for example style, location on page, bold, italic, underlined ect.

alexey.noskov · January 17, 2025, 6:14am

@ln22 The code provided by AI is not quite correct. Please try using the following code:

doc = aw.Document("C:\\Temp\\in.docx")

for paragraph in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    paragraph = paragraph.as_paragraph()
    for run in paragraph.get_child_nodes(aw.NodeType.RUN, True):
        run = run.as_run()
        print(run.text)

But if your goal is to extract text from the document, you can use the following code:

doc = aw.Document("C:\\Temp\\in.docx")
txt_save_options = aw.saving.TxtSaveOptions()
txt_save_options.preserve_table_layout = True
text = doc.to_string(txt_save_options).strip()
print(text)

ln22 · January 17, 2025, 8:55pm

Hello,

Why do some run.text equal things like ’ DOCVARIABLE “opt_hrt_29424” \* MERGEFORMAT ’ instead of text that is actually in the document. How can I filter runs like this out? Do they have a specific attribute that will tell me the run is not actual text that is in the document?

Is there a way to recursively walk through the document and stop the recursion at specific levels? For example, if I want to stop at the level of table so that I can extract table separately within my code rather than getting the table object and the seeing all objects below the table, how would I go about this?

alexey.noskov · January 18, 2025, 6:33am

@ln22

This is field code that is also represented as RUN in MS Word document object model. Please see our documentation for more information:
https://docs.aspose.com/words/python-net/fields-overview/

You can use the following code:

# Load the Word document
doc = aw.Document("C:\\Temp\\in.docx")
for s in doc.sections :
    s = s.as_section()
    for n in s.body.get_child_nodes(aw.NodeType.ANY, False):
        if(n.node_type == aw.NodeType.TABLE):
            # process the TABLE
            print("This is a table.")
        else :
            # process the other nodes.
            print("This is not a table.")