Extract Table of Contents

ln22 · January 16, 2025, 7:44pm

How can I extract table of contents Headers and text style with level number in python with Aspose.Words for Python via .NET.

Professionalize.Discourse · January 16, 2025, 7:44pm

@ln22

To extract the Table of Contents (TOC) headers and their corresponding text styles along with level numbers using Aspose.Words for Python via .NET, you can follow the steps outlined below. This process involves accessing the TOC field in the document and retrieving the necessary information.

Here’s a sample code snippet that demonstrates how to achieve this:

import aspose.words as aw

# Load the document
doc = aw.Document("your_document.docx")

# Iterate through the fields in the document
for field in doc.range.fields:
    if field.type == aw.fields.FieldType.FIELD_TOC:
        # Update the TOC to ensure it reflects the current document structure
        field.update()

        # Access the TOC entries
        for entry in field.result:
            # Extract the text of the TOC entry
            toc_text = entry.text
            # Extract the level of the TOC entry
            toc_level = entry.level
            # Extract the style of the TOC entry
            toc_style = entry.style

            print(f"TOC Entry: {toc_text}, Level: {toc_level}, Style: {toc_style.name}")

Explanation:

Load the Document: Use aw.Document to load your Word document.
Iterate Through Fields: Loop through the fields in the document to find the TOC field.
Update the TOC: Call field.update() to refresh the TOC entries based on the current document content.
Access TOC Entries: For each entry in the TOC, you can retrieve the text, level, and style.
Print the Information: Finally, print or store the extracted information as needed.

This code will help you extract the headers from the TOC along with their styles and levels, allowing you to manipulate or display them as required.

For more detailed information on working with the Table of Contents in Aspose.Words, you can refer to the official documentation here.

Sources:
[1]: Working with Table of Contents|Aspose.Words for Python via .NET

ln22 · January 16, 2025, 8:00pm

This code does not work. It give the following error for code toc_text = entry.text:

‘str’ object has no attribute ‘text’

alexey.noskov · January 17, 2025, 6:04am

@ln22 The code provided by AI is not quite correct. Could you please attach your input document and provide the expected output? We will check and provide you more information.

ln22 · January 17, 2025, 5:00pm

Testing Doc for Aspose Pulling Table of Contents.docx (18.1 KB)

Expected Output:

python_dict = {‘1 This is Heading 1’: ‘TOC1’, ‘1.1 This is Heading 1.1’: ‘TOC2’, ‘1.1.1 This is heading 1.1.1’: ‘TOC3’, ‘2 This is Heading 2’: ‘TOC1’, ‘2.1 This is Heading 2.1’: ‘TOC2’, ‘2.1.1 This is Heading 2.1.1’: ‘TOC3’}

Would be fine to just have heading level # aswell rather than string “TOC#”.

alexey.noskov · January 17, 2025, 5:31pm

@ln22 Thank you for additional information. You can use code like the following to achieve this:

# Load the Word document
doc = aw.Document("C:\\Temp\\in.docx")
# Get TOC paragraphs, they are marked with the appropriate styles.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
    p = p.as_paragraph()
    if p.paragraph_format.style_identifier == aw.StyleIdentifier.TOC1:
        print(p.to_string(aw.SaveFormat.TEXT).strip())
        print("TOC1")
    if p.paragraph_format.style_identifier == aw.StyleIdentifier.TOC2:
        print(p.to_string(aw.SaveFormat.TEXT).strip())
        print("TOC2")
# and so on for TOC1...TOC9 levels

ln22 · January 17, 2025, 5:52pm

Hello,

I will not know the style ahead of time. I want to get the style of each section in the table of contents and the section header text.

alexey.noskov · January 18, 2025, 6:14am

@ln22 In MS Word document TOC the items are always formatted with TOC1…TOC9 styles. So the the above provided code will get TOC items.

ln22 · January 21, 2025, 4:31pm

Thanks for your help thus far. Is there a way to automatically pull out this level # for TOC and for HEADING text? For example if a paragraph was of style TOC1 or HEADING1, Aspose could provide the integer 1?

alexey.noskov · January 21, 2025, 5:57pm

@ln22 You can use ParagraphFormat.outline_level property.