This word vba method (ActiveDocument.ConvertNumbersToText) converts all auto-numbering to text in a document. For example, in word when you have an auto-numbered list (1,2,3 or a,b,c or i,ii,iii), the list is not actually text and do not exist as text within the word xml files. They seem to be rendered by the word application when a user opens the document inside the word application itself.
I am looking to convert all these auto-numbered lists to text and then resave the word document similar to the vba script.
Does anyone know how I could accomplish this using the Aspose sweet with python?
@ln22 You can use the following code to convert list labels to regular text:
doc = aw.Document("C:\\Temp\\in.docx");
# Update list labels.
doc.update_list_labels()
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
para = p.as_paragraph()
if para.is_list_item :
label = para.list_label.label_string + "\t";
fakeListLabelRun = aw.Run(doc, label)
para.list_format.remove_numbers()
para.prepend_child(fakeListLabelRun)
doc.save("C:\\Temp\\out.docx");
@ln22 You can modify the code like this to preserve indents:
doc = aw.Document("C:\\Temp\\in.docx");
# Update list labels.
doc.update_list_labels()
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
para = p.as_paragraph()
if para.is_list_item :
label = para.list_label.label_string + "\t";
fakeListLabelRun = aw.Run(doc, label)
indent = para.list_format.list_level.number_position
para.list_format.remove_numbers()
para.prepend_child(fakeListLabelRun)
para.paragraph_format.left_indent = indent
doc.save("C:\\Temp\\out.docx");
This code works great except for when there is a list label inside of another list label. Is it possible to turn the second label into text? So, for example the code pulls out the text of (CC) correctly but then keeps (1) as a list item which I can then not pull the text out of.
When I run para.get_text() on the text paragraph, I get the following output showing the (1) LISTNUM exist there.
‘\x13 LISTNUM “zzmpLDNBasic||LDN Basic|2|1|1|1|0|1||1|0|32||1|0|0||1|0|0||1|0|0||1|0|0||1|0|0||mpNA||mpNA||” \l 5 \s \x15\tfirstly, if the’
@ln22 Could you please attach your input document here for testing and provide the expected output? We will check the issue and provide you more information.
I am looking to extract (1) within (A). I can see (A) within the para.list_label.label_string but not sure how to also extract (1) as string within that para.
I find this leads to first list label reseting. For example if a have (C) (1), after running p.range.unlink_fields() it turns to (A) (1) which is not what I want. I need the list label to stay intact if I review this document at a later date.
# Parse docx into memory via docx library for text extraction
aspose_doc = aw.Document(word_doc_bytes)
# Update list labels in the Aspose Document
aspose_doc.update_list_labels()
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in aspose_doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
para = p.as_paragraph()
if para.is_list_item and para.list_label.label_string != "":
current_para_left_indent = para.paragraph_format.left_indent
label = para.list_label.label_string + "\t"
fake_list_label_run = aw.Run(aspose_doc, label)
para.list_format.remove_numbers()
para.prepend_child(fake_list_label_run)
para.paragraph_format.left_indent = current_para_left_indent
para.range.unlink_fields()
else:
# Resaving the left_indent to standardize file indents
current_para_left_indent = para.paragraph_format.left_indent
para.paragraph_format.left_indent = current_para_left_indent
@ln22 Please unlink fields after replacing list labels with simple text:
doc = aw.Document("C:\\Temp\\in.docx")
doc.update_list_labels()
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
para = p.as_paragraph()
if para.is_list_item and para.list_label.label_string != "":
current_para_left_indent = para.paragraph_format.left_indent
label = para.list_label.label_string + "\t"
fake_list_label_run = aw.Run(doc, label)
para.list_format.remove_numbers()
para.prepend_child(fake_list_label_run)
para.paragraph_format.left_indent = current_para_left_indent
else:
# Resaving the left_indent to standardize file indents
current_para_left_indent = para.paragraph_format.left_indent
para.paragraph_format.left_indent = current_para_left_indent
doc.unlink_fields()
doc.save("C:\\Temp\\out.docx")
While this code seems to work for the example document I provided, for documents I am reviewing I find it intermittently fails and leads incorrect second list labels. For example, in one document I am looking at it changed “(a) (i)” to “(a) (a)”. Also, method .unlink_fields() seems to remove table of contents link meta data from the documents. Is there another way to complete what I am looking for without using unlink_fields()?
@ln22 If possible please attach the problematic input document, so we can test the scenario on our side. Generally the problem occurs because the code removes list numbering from the paragraph, and LISTNUM fields are used in the document. When list numbering is removed the LISTNUM field is updated appropriately. Please try used the following code:
doc = aw.Document("C:\\Temp\\in.docx")
doc.update_list_labels()
doc.update_fields()
# Lock list fields
for f in doc.range.fields:
if f.type == aw.fields.FieldType.FIELD_LIST_NUM:
f.is_locked = True
# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True):
para = p.as_paragraph()
if para.is_list_item and para.list_label.label_string != "":
current_para_left_indent = para.paragraph_format.left_indent
label = para.list_label.label_string + "\t"
fake_list_label_run = aw.Run(doc, label)
para.list_format.remove_numbers()
para.prepend_child(fake_list_label_run)
para.paragraph_format.left_indent = current_para_left_indent
else:
# Resaving the left_indent to standardize file indents
current_para_left_indent = para.paragraph_format.left_indent
para.paragraph_format.left_indent = current_para_left_indent
doc.unlink_fields()
doc.save("C:\\Temp\\out.docx")
This code also does not work. The method .unlink_fields() removes internal _bookmark links from the table of contents to the document. I need those links so I cannot use that method.
When a second LISTNUM is within a LISTNUM paragraph I see the list num is always the first run. For example, when the list starts with (CC) (1), the (1) is within the para.runs[0]. I see para.runs[0].text is equal to:
’ LISTNUM “zzmpLDNBasic||LDN Basic|2|1|1|1|0|1||1|0|32||1|0|0||1|0|0||1|0|0||1|0|0||1|0|0||mpNA||mpNA||” \l 5 \s ’
Is it possible to convert this LISTNUM from the above text to ‘(1)’ inline? You could isolate when a para.runs[0] is LISTNUM and convert to the text that is represents.