Extract Subsection List Text Meta Data

ln22 · September 9, 2025, 10:02pm

How can I extract meta data on auto subsectioned list text level within a word docx with words python library? I would want an output basically that states things like 1.a.i to level info like 1.1.1 to know numerically what level the current text is in.

Aspose list levels.docx (15.1 KB)

Professionalize.Discourse · September 9, 2025, 10:03pm

@ln22

Extracting List‑Level Information from a DOCX with Aspose.Words for Python via .NET

Aspose.Words can read the hierarchical numbering of a multi‑level list (e.g., 1.a.i) and also tell you the numeric level (0‑based).
Below is a complete, ready‑to‑run example that:

Loads the Word document.
Walks through every paragraph.
Detects whether the paragraph belongs to a list.
Retrieves the displayed label (1., a., i. …) and the list level (0, 1, 2 …).
Prints the label together with the paragraph text.

# --------------------------------------------------------------
#   Aspose.Words for Python via .NET – List level extraction
# --------------------------------------------------------------

import aspose.words as aw

# ------------------------------------------------------------------
# 1. Load the DOCX (replace the path with your own file location)
# ------------------------------------------------------------------
doc = aw.Document("Aspose list levels.docx")

# ------------------------------------------------------------------
# 2. Iterate over all paragraphs in the document
# ------------------------------------------------------------------
for para in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    paragraph = para.as_paragraph()

    # ------------------------------------------------------------------
    # 3. Check whether the paragraph is part of a list
    # ------------------------------------------------------------------
    list_fmt = paragraph.list_format
    if not list_fmt.is_list:
        continue                      # skip non‑list paragraphs

    # ------------------------------------------------------------------
    # 4. Get the label that Word shows (e.g. "1.", "a.", "i.")
    # ------------------------------------------------------------------
    label = list_fmt.list_label       # type: aw.ListLabel
    displayed_label = label.text      # already includes the trailing dot

    # ------------------------------------------------------------------
    # 5. Get the numeric level (0 = top level, 1 = second level, …)
    # ------------------------------------------------------------------
    level_index = list_fmt.list_level_number   # 0‑based index
    level_number = level_index + 1              # 1‑based for readability

    # ------------------------------------------------------------------
    # 6. Retrieve the plain paragraph text (without the label)
    # ------------------------------------------------------------------
    para_text = paragraph.to_string(aw.SaveFormat.TEXT).strip()

    # ------------------------------------------------------------------
    # 7. Output the information
    # ------------------------------------------------------------------
    print(f"Level {level_number:>2} | Label: {displayed_label:<4} | Text: {para_text}")

# --------------------------------------------------------------
# End of script
# --------------------------------------------------------------

What the script prints

For the sample Aspose list levels.docx the output looks similar to:

Level  1 | Label: 1.   | Text: Section 1
Level  2 | Label: a.   | Text: Sub‑section a
Level  3 | Label: i.   | Text: Sub‑sub‑section i
Level  1 | Label: 2.   | Text: Section 2
Level  2 | Label: b.   | Text: Sub‑section b
...

Level – the hierarchical depth (1 = top level, 2 = second level, …).
Label – exactly what Word displays for that paragraph (1., a., i. etc.).
Text – the actual paragraph content without the numbering.

How It Works Under the Hood

Aspose.Words API	Meaning
`Paragraph.ListFormat.IsList`	Tells whether the paragraph participates in a list.
`Paragraph.ListFormat.ListLabel`	Gives the formatted label (`1.`, `a.`, `i.`) that Word renders.
`Paragraph.ListFormat.ListLevelNumber`	Zero‑based index of the list level (0 = first level).
`Document.Lists[ listId ]`	Gives access to the `List` object if you need further details (e.g., number style).

If you ever need the raw numeric value of a level (e.g., 1 instead of a), you can inspect the ListLevel.NumberStyle:

list_obj   = doc.lists[list_fmt.list_id]
list_level = list_obj.get_level(level_index)

if list_level.number_style == aw.NumberStyle.ARABIC:
    # value is integer (1, 2, 3 …)
elif list_level.number_style == aw.NumberStyle.LOWER_LETTER:
    # value is a letter (a, b, c …)
elif list_level.number_style == aw.NumberStyle.LOWER_ROMAN:
    # value is i, ii, iii …
# etc.

Useful Documentation Links

ListLabel – details on ListLabel.Text and related members:
https://docs.aspose.com/words/python-net/listlabel/
ListFormat – how to detect list paragraphs and retrieve level numbers:
https://docs.aspose.com/words/python-net/listformat/
Working with Lists (tutorial) – end‑to‑end examples:
https://docs.aspose.com/words/python-net/working-with-lists/
Saving a Document to Plain Text (used for Paragraph.ToString):
https://docs.aspose.com/words/python-net/save-as-text/

Customising the Output

Full hierarchical number (e.g., 1.2.3) – combine the ListLabel.Text of all parent levels:

full_label = ""
for i in range(level_index + 1):
    full_label += doc.lists[list_fmt.list_id].get_level(i).list_label.text

Export to CSV / JSON – simply collect the data in a list of dictionaries and then write it with csv or json modules.

Gotchas & Tips

Situation	What to Do
Skipped paragraphs – some lines may be inside a table cell or header/footer.	Use `paragraph.is_inside_a_table` or iterate over sections (`doc.sections`) as needed.
Different list definitions in the same document – multiple `List` objects.	The code above automatically uses the `ListId` of each paragraph, so each list is handled correctly.
List continuation after a page break – numbering continues automatically; the same `ListId` is reused, so you’ll still obtain the correct label.	No extra handling required.

If you have any further requirements—e.g., exporting the hierarchy to JSON, handling nested tables, or customizing the number format—just let us know, and we’ll provide a tailored snippet!

ln22 · September 9, 2025, 10:10pm

Professionalize.Discourse:

# --------------------------------------------------------------
#   Aspose.Words for Python via .NET – List level extraction
# --------------------------------------------------------------

import aspose.words as aw

# ------------------------------------------------------------------
# 1. Load the DOCX (replace the path with your own file location)
# ------------------------------------------------------------------
doc = aw.Document("Aspose list levels.docx")

# ------------------------------------------------------------------
# 2. Iterate over all paragraphs in the document
# ------------------------------------------------------------------
for para in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    paragraph = para.as_paragraph()

    # ------------------------------------------------------------------
    # 3. Check whether the paragraph is part of a list
    # ------------------------------------------------------------------
    list_fmt = paragraph.list_format
    if not list_fmt.is_list:
        continue                      # skip non‑list paragraphs

    # ------------------------------------------------------------------
    # 4. Get the label that Word shows (e.g. "1.", "a.", "i.")
    # ------------------------------------------------------------------
    label = list_fmt.list_label       # type: aw.ListLabel
    displayed_label = label.text      # already includes the trailing dot

    # ------------------------------------------------------------------
    # 5. Get the numeric level (0 = top level, 1 = second level, …)
    # ------------------------------------------------------------------
    level_index = list_fmt.list_level_number   # 0‑based index
    level_number = level_index + 1              # 1‑based for readability

    # ------------------------------------------------------------------
    # 6. Retrieve the plain paragraph text (without the label)
    # ------------------------------------------------------------------
    para_text = paragraph.to_string(aw.SaveFormat.TEXT).strip()

    # ------------------------------------------------------------------
    # 7. Output the information
    # ------------------------------------------------------------------
    print(f"Level {level_number:>2} | Label: {displayed_label:<4} | Text: {para_text}")

# --------------------------------------------------------------
# End of script
# --------------------------------------------------------------

This code provides the following error:

if not list_fmt.is_list:
^^^^^^^^^^^^^^^^
AttributeError: ‘aspose.words.lists.ListFormat’ object has no attribute ‘is_list’

alexey.noskov · September 10, 2025, 4:23am

@ln22 You can get the required information using ListFormat, ListLevel and ListLabel properties. For example see the following code:

doc = aw.Document("C:\\Temp\\in.docx")

# If it is required to get actual list item label it is required to call update_list_labels method.
doc.update_list_labels()

for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    p = p.as_paragraph()
    if p.is_list_item:
        print(f"Label string: {p.list_label.label_string}") # Actual (displayed) list label string.
        print(f"Label value: {p.list_label.label_value}") # List label numeric value
        print(f"Level: {p.list_format.list_level_number}") # Level of the list item.
        print("------------------------------")

ln22 · September 10, 2025, 3:18pm

Aspose list levels with non list paragraphs.docx (15.4 KB)

@alexey.noskov
Can you please help me add to this codebase you shared above by showing me how to find the label value and level of paragraphs that are not specifically list items but are on the same level as a specific list item? For example, in the document I attached, I would want to know that the text “Text within the level 1.1 a) level but not specifically a list value” is that same label value and level value as a) which is label value: 1 and level value: 1.

alexey.noskov · September 10, 2025, 8:04pm

@ln22 The mentioned paragraph do not belong to the list items so there is no direct way to to determine they are “on the same level”. To achieve this it is required to implement custom logic.

ln22 · September 11, 2025, 8:58pm

Could you take a swing at this custom logic or your thought on how you would use aspose to implement this logic?

alexey.noskov · September 12, 2025, 3:40am

@ln22 You can consider the paragraphs that come after list item, but are not list item as paragraphs that belongs to the list item section.