How can I extract meta data on auto subsectioned list text level within a word docx with words python library? I would want an output basically that states things like 1.a.i to level info like 1.1.1 to know numerically what level the current text is in.
Extracting List‑Level Information from a DOCX with Aspose.Words for Python via .NET
Aspose.Words can read the hierarchical numbering of a multi‑level list (e.g., 1.a.i) and also tell you the numeric level (0‑based).
Below is a complete, ready‑to‑run example that:
Loads the Word document.
Walks through every paragraph.
Detects whether the paragraph belongs to a list.
Retrieves the displayed label (1., a., i. …) and the list level (0, 1, 2 …).
Prints the label together with the paragraph text.
# --------------------------------------------------------------
# Aspose.Words for Python via .NET – List level extraction
# --------------------------------------------------------------
import aspose.words as aw
# ------------------------------------------------------------------
# 1. Load the DOCX (replace the path with your own file location)
# ------------------------------------------------------------------
doc = aw.Document("Aspose list levels.docx")
# ------------------------------------------------------------------
# 2. Iterate over all paragraphs in the document
# ------------------------------------------------------------------
for para in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
paragraph = para.as_paragraph()
# ------------------------------------------------------------------
# 3. Check whether the paragraph is part of a list
# ------------------------------------------------------------------
list_fmt = paragraph.list_format
if not list_fmt.is_list:
continue # skip non‑list paragraphs
# ------------------------------------------------------------------
# 4. Get the label that Word shows (e.g. "1.", "a.", "i.")
# ------------------------------------------------------------------
label = list_fmt.list_label # type: aw.ListLabel
displayed_label = label.text # already includes the trailing dot
# ------------------------------------------------------------------
# 5. Get the numeric level (0 = top level, 1 = second level, …)
# ------------------------------------------------------------------
level_index = list_fmt.list_level_number # 0‑based index
level_number = level_index + 1 # 1‑based for readability
# ------------------------------------------------------------------
# 6. Retrieve the plain paragraph text (without the label)
# ------------------------------------------------------------------
para_text = paragraph.to_string(aw.SaveFormat.TEXT).strip()
# ------------------------------------------------------------------
# 7. Output the information
# ------------------------------------------------------------------
print(f"Level {level_number:>2} | Label: {displayed_label:<4} | Text: {para_text}")
# --------------------------------------------------------------
# End of script
# --------------------------------------------------------------
What the script prints
For the sample Aspose list levels.docx the output looks similar to:
Level 1 | Label: 1. | Text: Section 1
Level 2 | Label: a. | Text: Sub‑section a
Level 3 | Label: i. | Text: Sub‑sub‑section i
Level 1 | Label: 2. | Text: Section 2
Level 2 | Label: b. | Text: Sub‑section b
...
Level – the hierarchical depth (1 = top level, 2 = second level, …).
Label – exactly what Word displays for that paragraph (1., a., i. etc.).
Text – the actual paragraph content without the numbering.
How It Works Under the Hood
Aspose.Words API
Meaning
Paragraph.ListFormat.IsList
Tells whether the paragraph participates in a list.
Paragraph.ListFormat.ListLabel
Gives the formatted label (1., a., i.) that Word renders.
Paragraph.ListFormat.ListLevelNumber
Zero‑based index of the list level (0 = first level).
Document.Lists[ listId ]
Gives access to the List object if you need further details (e.g., number style).
If you ever need the raw numeric value of a level (e.g., 1 instead of a), you can inspect the ListLevel.NumberStyle:
list_obj = doc.lists[list_fmt.list_id]
list_level = list_obj.get_level(level_index)
if list_level.number_style == aw.NumberStyle.ARABIC:
# value is integer (1, 2, 3 …)
elif list_level.number_style == aw.NumberStyle.LOWER_LETTER:
# value is a letter (a, b, c …)
elif list_level.number_style == aw.NumberStyle.LOWER_ROMAN:
# value is i, ii, iii …
# etc.
Full hierarchical number (e.g., 1.2.3) – combine the ListLabel.Text of all parent levels:
full_label = ""
for i in range(level_index + 1):
full_label += doc.lists[list_fmt.list_id].get_level(i).list_label.text
Export to CSV / JSON – simply collect the data in a list of dictionaries and then write it with csv or json modules.
Gotchas & Tips
Situation
What to Do
Skipped paragraphs – some lines may be inside a table cell or header/footer.
Use paragraph.is_inside_a_table or iterate over sections (doc.sections) as needed.
Different list definitions in the same document – multiple List objects.
The code above automatically uses the ListId of each paragraph, so each list is handled correctly.
List continuation after a page break – numbering continues automatically; the same ListId is reused, so you’ll still obtain the correct label.
No extra handling required.
If you have any further requirements—e.g., exporting the hierarchy to JSON, handling nested tables, or customizing the number format—just let us know, and we’ll provide a tailored snippet!
@ln22 You can get the required information using ListFormat, ListLevel and ListLabel properties. For example see the following code:
doc = aw.Document("C:\\Temp\\in.docx")
# If it is required to get actual list item label it is required to call update_list_labels method.
doc.update_list_labels()
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
p = p.as_paragraph()
if p.is_list_item:
print(f"Label string: {p.list_label.label_string}") # Actual (displayed) list label string.
print(f"Label value: {p.list_label.label_value}") # List label numeric value
print(f"Level: {p.list_format.list_level_number}") # Level of the list item.
print("------------------------------")
@alexey.noskov
Can you please help me add to this codebase you shared above by showing me how to find the label value and level of paragraphs that are not specifically list items but are on the same level as a specific list item? For example, in the document I attached, I would want to know that the text “Text within the level 1.1 a) level but not specifically a list value” is that same label value and level value as a) which is label value: 1 and level value: 1.
@ln22 The mentioned paragraph do not belong to the list items so there is no direct way to to determine they are “on the same level”. To achieve this it is required to implement custom logic.