Converting Word Lists to Text

ln22 · April 24, 2024, 3:42pm

Hello,

I am trying to mimic the following VBA code using Aspose:
ConvertNumbersToText Method | Microsoft Learn

This word vba method (ActiveDocument.ConvertNumbersToText) converts all auto-numbering to text in a document. For example, in word when you have an auto-numbered list (1,2,3 or a,b,c or i,ii,iii), the list is not actually text and do not exist as text within the word xml files. They seem to be rendered by the word application when a user opens the document inside the word application itself.

I am looking to convert all these auto-numbered lists to text and then resave the word document similar to the vba script.

Does anyone know how I could accomplish this using the Aspose sweet with python?

Regards,
SM

alexey.noskov · April 24, 2024, 6:36pm

@ln22 You can use the following code to convert list labels to regular text:

doc = aw.Document("C:\\Temp\\in.docx");
           
# Update list labels.
doc.update_list_labels()

# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
    para = p.as_paragraph()
    if para.is_list_item :
        label = para.list_label.label_string + "\t";
        fakeListLabelRun = aw.Run(doc, label)
        para.list_format.remove_numbers()
        para.prepend_child(fakeListLabelRun)

doc.save("C:\\Temp\\out.docx");

ln22 · April 24, 2024, 9:38pm

Hello,

This code worked great. I have a Aspose.Total for .NET license. Can I also use Aspose.Total for Python via .NET with this same license?

Regards,
SM

alexey.noskov · April 25, 2024, 4:39am

@ln22 No, Aspose.Words for .NET and Aspose.Words for Python via .NET are different products and require different licenses.

ln22 · April 30, 2024, 4:50pm

Hello Alexey,

This removes all left indentation for all lists. Is there a way to fix this code to keep indentation intact?

Regards,
SM

alexey.noskov · April 30, 2024, 6:27pm

@ln22 You can modify the code like this to preserve indents:

doc = aw.Document("C:\\Temp\\in.docx");
           
# Update list labels.
doc.update_list_labels()

# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
    para = p.as_paragraph()
    if para.is_list_item :
        label = para.list_label.label_string + "\t";
        fakeListLabelRun = aw.Run(doc, label)
        indent = para.list_format.list_level.number_position
        para.list_format.remove_numbers()
        para.prepend_child(fakeListLabelRun)
        para.paragraph_format.left_indent = indent

doc.save("C:\\Temp\\out.docx");

ln22 · August 1, 2024, 5:16pm

Hello,

This code works great except for when there is a list label inside of another list label. Is it possible to turn the second label into text? So, for example the code pulls out the text of (CC) correctly but then keeps (1) as a list item which I can then not pull the text out of.

When I run para.get_text() on the text paragraph, I get the following output showing the (1) LISTNUM exist there.
‘\x13 LISTNUM “zzmpLDNBasic||LDN Basic|2|1|1|1|0|1||1|0|32||1|0|0||1|0|0||1|0|0||1|0|0||1|0|0||mpNA||mpNA||” \l 5 \s \x15\tfirstly, if the’

alexey.noskov · August 1, 2024, 6:37pm

@ln22 Could you please attach your input document here for testing and provide the expected output? We will check the issue and provide you more information.

ln22 · August 1, 2024, 6:59pm

@alexey.noskov

Testing_doc_4_aspose.docx (20.7 KB)

I am looking to extract (1) within (A). I can see (A) within the para.list_label.label_string but not sure how to also extract (1) as string within that para.

ln22 · August 1, 2024, 7:03pm

@alexey.noskov

Testing_doc_4_aspose_fixed.docx (20.5 KB)

One solution could be to take (1) and indent it down one to its correct position as shown in this doc. Not sure this could be accomplished though

alexey.noskov · August 2, 2024, 4:14am

@ln22 The second number it represented by LISTNIM field. You can get the value of this field using the following code:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
Paragraph p = doc.getFirstSection().getBody().getParagraphs().get(3);
p.getRange().unlinkFields();
System.out.println(p.toString(SaveFormat.TEXT));

ln22 · August 5, 2024, 2:29pm

Would that code be equivalent to: p.range.unlink_fields() in python?

alexey.noskov · August 5, 2024, 2:54pm

@ln22 Yes, here is the code in Python:

doc = aw.Document("C:\\Temp\\in.docx")
doc.update_list_labels()
p = doc.first_section.body.paragraphs[4]
p.range.unlink_fields()
print(p.to_string(aw.SaveFormat.TEXT))

ln22 · August 5, 2024, 3:02pm

I find this leads to first list label reseting. For example if a have (C) (1), after running p.range.unlink_fields() it turns to (A) (1) which is not what I want. I need the list label to stay intact if I review this document at a later date.

alexey.noskov · August 5, 2024, 6:38pm

@ln22 Could you please attach your input document here for testing? I do not see this problem with the document you have attached earlier.

ln22 · August 5, 2024, 7:41pm

I use the following code.

    # Parse docx into memory via docx library for text extraction
    aspose_doc = aw.Document(word_doc_bytes)

    # Update list labels in the Aspose Document
    aspose_doc.update_list_labels()

    # Convert list items into regular paragraphs with leading text that imitates numbering.
    for p in aspose_doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
        para = p.as_paragraph()
        if para.is_list_item and para.list_label.label_string != "":
            current_para_left_indent = para.paragraph_format.left_indent
            label = para.list_label.label_string + "\t"
            fake_list_label_run = aw.Run(aspose_doc, label)
            para.list_format.remove_numbers()
            para.prepend_child(fake_list_label_run)
            para.paragraph_format.left_indent = current_para_left_indent
            para.range.unlink_fields()
        else:
            # Resaving the left_indent to standardize file indents
            current_para_left_indent = para.paragraph_format.left_indent
            para.paragraph_format.left_indent = current_para_left_indent

My original file is:
Testing_doc_4_aspose_before_code.docx (23.1 KB)

My file after running the code and saving the word file again is:
Testing_doc_4_aspose_after_code.docx (17.7 KB)

As you can see, (D) (E) (F) turn into (B) (C) (D). I need them to not change which also exposing (1) as a string not an auto numbered list.

alexey.noskov · August 6, 2024, 4:40am

@ln22 Please unlink fields after replacing list labels with simple text:

doc = aw.Document("C:\\Temp\\in.docx")
doc.update_list_labels()
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    para = p.as_paragraph()
    if para.is_list_item and para.list_label.label_string != "":
        current_para_left_indent = para.paragraph_format.left_indent
        label = para.list_label.label_string + "\t"
        fake_list_label_run = aw.Run(doc, label)
        para.list_format.remove_numbers()
        para.prepend_child(fake_list_label_run)
        para.paragraph_format.left_indent = current_para_left_indent
    else:
        # Resaving the left_indent to standardize file indents
        current_para_left_indent = para.paragraph_format.left_indent
        para.paragraph_format.left_indent = current_para_left_indent

doc.unlink_fields()

doc.save("C:\\Temp\\out.docx")

out.docx (17.7 KB)

ln22 · August 6, 2024, 9:09pm

alexey.noskov:

doc.update_list_labels()
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    para = p.as_paragraph()
    if para.is_list_item and para.list_label.label_string != "":
        current_para_left_indent = para.paragraph_format.left_indent
        label = para.list_label.label_string + "\t"
        fake_list_label_run = aw.Run(doc, label)
        para.list_format.remove_numbers()
        para.prepend_child(fake_list_label_run)
        para.paragraph_format.left_indent = current_para_left_indent
    else:
        # Resaving the left_indent to standardize file indents
        current_para_left_indent = para.paragraph_format.left_indent
        para.paragraph_format.left_indent = current_para_left_indent

doc.unlink_fields()

While this code seems to work for the example document I provided, for documents I am reviewing I find it intermittently fails and leads incorrect second list labels. For example, in one document I am looking at it changed “(a) (i)” to “(a) (a)”. Also, method .unlink_fields() seems to remove table of contents link meta data from the documents. Is there another way to complete what I am looking for without using unlink_fields()?

alexey.noskov · August 7, 2024, 4:13am

@ln22 If possible please attach the problematic input document, so we can test the scenario on our side. Generally the problem occurs because the code removes list numbering from the paragraph, and LISTNUM fields are used in the document. When list numbering is removed the LISTNUM field is updated appropriately. Please try used the following code:

doc = aw.Document("C:\\Temp\\in.docx")
doc.update_list_labels()

doc.update_fields()
# Lock list fields
for f in doc.range.fields:
    if f.type == aw.fields.FieldType.FIELD_LIST_NUM:
        f.is_locked = True

# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
    para = p.as_paragraph()
    if para.is_list_item and para.list_label.label_string != "":
        current_para_left_indent = para.paragraph_format.left_indent
        label = para.list_label.label_string + "\t"
        fake_list_label_run = aw.Run(doc, label)
        para.list_format.remove_numbers()
        para.prepend_child(fake_list_label_run)
        para.paragraph_format.left_indent = current_para_left_indent
    else:
        # Resaving the left_indent to standardize file indents
        current_para_left_indent = para.paragraph_format.left_indent
        para.paragraph_format.left_indent = current_para_left_indent

doc.unlink_fields()

doc.save("C:\\Temp\\out.docx")

ln22 · August 7, 2024, 1:48pm

This code also does not work. The method .unlink_fields() removes internal _bookmark links from the table of contents to the document. I need those links so I cannot use that method.

When a second LISTNUM is within a LISTNUM paragraph I see the list num is always the first run. For example, when the list starts with (CC) (1), the (1) is within the para.runs[0]. I see para.runs[0].text is equal to:
’ LISTNUM “zzmpLDNBasic||LDN Basic|2|1|1|1|0|1||1|0|32||1|0|0||1|0|0||1|0|0||1|0|0||1|0|0||mpNA||mpNA||” \l 5 \s ’

Is it possible to convert this LISTNUM from the above text to ‘(1)’ inline? You could isolate when a para.runs[0] is LISTNUM and convert to the text that is represents.