Extract Subsection List Text Meta Data

How can I extract meta data on auto subsectioned list text level within a word docx with words python library? I would want an output basically that states things like 1.a.i to level info like 1.1.1 to know numerically what level the current text is in.

Aspose list levels.docx (15.1 KB)

@ln22

Extracting List‑Level Information from a DOCX with Aspose.Words for Python via .NET

Aspose.Words can read the hierarchical numbering of a multi‑level list (e.g., 1.a.i) and also tell you the numeric level (0‑based).
Below is a complete, ready‑to‑run example that:

  1. Loads the Word document.
  2. Walks through every paragraph.
  3. Detects whether the paragraph belongs to a list.
  4. Retrieves the displayed label (1., a., i. …) and the list level (0, 1, 2 …).
  5. Prints the label together with the paragraph text.
# --------------------------------------------------------------
#   Aspose.Words for Python via .NET – List level extraction
# --------------------------------------------------------------

import aspose.words as aw

# ------------------------------------------------------------------
# 1. Load the DOCX (replace the path with your own file location)
# ------------------------------------------------------------------
doc = aw.Document("Aspose list levels.docx")

# ------------------------------------------------------------------
# 2. Iterate over all paragraphs in the document
# ------------------------------------------------------------------
for para in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    paragraph = para.as_paragraph()

    # ------------------------------------------------------------------
    # 3. Check whether the paragraph is part of a list
    # ------------------------------------------------------------------
    list_fmt = paragraph.list_format
    if not list_fmt.is_list:
        continue                      # skip non‑list paragraphs

    # ------------------------------------------------------------------
    # 4. Get the label that Word shows (e.g. "1.", "a.", "i.")
    # ------------------------------------------------------------------
    label = list_fmt.list_label       # type: aw.ListLabel
    displayed_label = label.text      # already includes the trailing dot

    # ------------------------------------------------------------------
    # 5. Get the numeric level (0 = top level, 1 = second level, …)
    # ------------------------------------------------------------------
    level_index = list_fmt.list_level_number   # 0‑based index
    level_number = level_index + 1              # 1‑based for readability

    # ------------------------------------------------------------------
    # 6. Retrieve the plain paragraph text (without the label)
    # ------------------------------------------------------------------
    para_text = paragraph.to_string(aw.SaveFormat.TEXT).strip()

    # ------------------------------------------------------------------
    # 7. Output the information
    # ------------------------------------------------------------------
    print(f"Level {level_number:>2} | Label: {displayed_label:<4} | Text: {para_text}")

# --------------------------------------------------------------
# End of script
# --------------------------------------------------------------

What the script prints

For the sample Aspose list levels.docx the output looks similar to:

Level  1 | Label: 1.   | Text: Section 1
Level  2 | Label: a.   | Text: Sub‑section a
Level  3 | Label: i.   | Text: Sub‑sub‑section i
Level  1 | Label: 2.   | Text: Section 2
Level  2 | Label: b.   | Text: Sub‑section b
...
  • Level – the hierarchical depth (1 = top level, 2 = second level, …).
  • Label – exactly what Word displays for that paragraph (1., a., i. etc.).
  • Text – the actual paragraph content without the numbering.

How It Works Under the Hood

Aspose.Words API Meaning
Paragraph.ListFormat.IsList Tells whether the paragraph participates in a list.
Paragraph.ListFormat.ListLabel Gives the formatted label (1., a., i.) that Word renders.
Paragraph.ListFormat.ListLevelNumber Zero‑based index of the list level (0 = first level).
Document.Lists[ listId ] Gives access to the List object if you need further details (e.g., number style).

If you ever need the raw numeric value of a level (e.g., 1 instead of a), you can inspect the ListLevel.NumberStyle:

list_obj   = doc.lists[list_fmt.list_id]
list_level = list_obj.get_level(level_index)

if list_level.number_style == aw.NumberStyle.ARABIC:
    # value is integer (1, 2, 3 …)
elif list_level.number_style == aw.NumberStyle.LOWER_LETTER:
    # value is a letter (a, b, c …)
elif list_level.number_style == aw.NumberStyle.LOWER_ROMAN:
    # value is i, ii, iii …
# etc.

Useful Documentation Links


Customising the Output

  • Full hierarchical number (e.g., 1.2.3) – combine the ListLabel.Text of all parent levels:

    full_label = ""
    for i in range(level_index + 1):
        full_label += doc.lists[list_fmt.list_id].get_level(i).list_label.text
    
  • Export to CSV / JSON – simply collect the data in a list of dictionaries and then write it with csv or json modules.


Gotchas & Tips

Situation What to Do
Skipped paragraphs – some lines may be inside a table cell or header/footer. Use paragraph.is_inside_a_table or iterate over sections (doc.sections) as needed.
Different list definitions in the same document – multiple List objects. The code above automatically uses the ListId of each paragraph, so each list is handled correctly.
List continuation after a page break – numbering continues automatically; the same ListId is reused, so you’ll still obtain the correct label. No extra handling required.

If you have any further requirements—e.g., exporting the hierarchy to JSON, handling nested tables, or customizing the number format—just let us know, and we’ll provide a tailored snippet!

This code provides the following error:

if not list_fmt.is_list:
^^^^^^^^^^^^^^^^
AttributeError: ‘aspose.words.lists.ListFormat’ object has no attribute ‘is_list’

@ln22 You can get the required information using ListFormat, ListLevel and ListLabel properties. For example see the following code:

doc = aw.Document("C:\\Temp\\in.docx")

# If it is required to get actual list item label it is required to call update_list_labels method.
doc.update_list_labels()

for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    p = p.as_paragraph()
    if p.is_list_item:
        print(f"Label string: {p.list_label.label_string}") # Actual (displayed) list label string.
        print(f"Label value: {p.list_label.label_value}") # List label numeric value
        print(f"Level: {p.list_format.list_level_number}") # Level of the list item.
        print("------------------------------")

Aspose list levels with non list paragraphs.docx (15.4 KB)

@alexey.noskov
Can you please help me add to this codebase you shared above by showing me how to find the label value and level of paragraphs that are not specifically list items but are on the same level as a specific list item? For example, in the document I attached, I would want to know that the text “Text within the level 1.1 a) level but not specifically a list value” is that same label value and level value as a) which is label value: 1 and level value: 1.

@ln22 The mentioned paragraph do not belong to the list items so there is no direct way to to determine they are “on the same level”. To achieve this it is required to implement custom logic.

Could you take a swing at this custom logic or your thought on how you would use aspose to implement this logic?

@ln22 You can consider the paragraphs that come after list item, but are not list item as paragraphs that belongs to the list item section.