How can I extract table of contents Headers and text style with level number in python with Aspose.Words for Python via .NET.
To extract the Table of Contents (TOC) headers and their corresponding text styles along with level numbers using Aspose.Words for Python via .NET, you can follow the steps outlined below. This process involves accessing the TOC field in the document and retrieving the necessary information.
Here’s a sample code snippet that demonstrates how to achieve this:
import aspose.words as aw
# Load the document
doc = aw.Document("your_document.docx")
# Iterate through the fields in the document
for field in doc.range.fields:
if field.type == aw.fields.FieldType.FIELD_TOC:
# Update the TOC to ensure it reflects the current document structure
field.update()
# Access the TOC entries
for entry in field.result:
# Extract the text of the TOC entry
toc_text = entry.text
# Extract the level of the TOC entry
toc_level = entry.level
# Extract the style of the TOC entry
toc_style = entry.style
print(f"TOC Entry: {toc_text}, Level: {toc_level}, Style: {toc_style.name}")
Explanation:
- Load the Document: Use
aw.Document
to load your Word document. - Iterate Through Fields: Loop through the fields in the document to find the TOC field.
- Update the TOC: Call
field.update()
to refresh the TOC entries based on the current document content. - Access TOC Entries: For each entry in the TOC, you can retrieve the text, level, and style.
- Print the Information: Finally, print or store the extracted information as needed.
This code will help you extract the headers from the TOC along with their styles and levels, allowing you to manipulate or display them as required.
For more detailed information on working with the Table of Contents in Aspose.Words, you can refer to the official documentation here.
Sources:
[1]: Working with Table of Contents|Aspose.Words for Python via .NET
This code does not work. It give the following error for code toc_text = entry.text
:
‘str’ object has no attribute ‘text’
@ln22 The code provided by AI is not quite correct. Could you please attach your input document and provide the expected output? We will check and provide you more information.
Testing Doc for Aspose Pulling Table of Contents.docx (18.1 KB)
Expected Output:
python_dict = {‘1 This is Heading 1’: ‘TOC1’, ‘1.1 This is Heading 1.1’: ‘TOC2’, ‘1.1.1 This is heading 1.1.1’: ‘TOC3’, ‘2 This is Heading 2’: ‘TOC1’, ‘2.1 This is Heading 2.1’: ‘TOC2’, ‘2.1.1 This is Heading 2.1.1’: ‘TOC3’}
Would be fine to just have heading level # aswell rather than string “TOC#”.
@ln22 Thank you for additional information. You can use code like the following to achieve this:
# Load the Word document
doc = aw.Document("C:\\Temp\\in.docx")
# Get TOC paragraphs, they are marked with the appropriate styles.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
p = p.as_paragraph()
if p.paragraph_format.style_identifier == aw.StyleIdentifier.TOC1:
print(p.to_string(aw.SaveFormat.TEXT).strip())
print("TOC1")
if p.paragraph_format.style_identifier == aw.StyleIdentifier.TOC2:
print(p.to_string(aw.SaveFormat.TEXT).strip())
print("TOC2")
# and so on for TOC1...TOC9 levels
Hello,
I will not know the style ahead of time. I want to get the style of each section in the table of contents and the section header text.
@ln22 In MS Word document TOC the items are always formatted with TOC1…TOC9 styles. So the the above provided code will get TOC items.
Thanks for your help thus far. Is there a way to automatically pull out this level # for TOC and for HEADING text? For example if a paragraph was of style TOC1 or HEADING1, Aspose could provide the integer 1?