Node to HTML strings taking long time in conversion

anshuman.tiwari1 · October 11, 2024, 5:05pm

Hi Team,

We are trying to convert nodes to HTML strings for a document(.docx) using aspose words in python, but the execution is taking ~ 500 ms for each node on my machine. For 10K nodes, total time taken for this increases significantly.

Please tell us what can be optimal way for this conversion for reducing the processing time.

I have attached sample 1 MB document and minimum reproducible code for the same. Please help.

Minimum reproducible Code :

import aspose.words as aw
from uuid import uuid4
from datetime import datetime

# Load license
license = aw.License()
license.set_license("Aspose Total Product Family license")

# Load the document
doc = aw.Document("<LOCAL PATH FOR 1mb.docx>")

# Set options
options = aw.saving.HtmlSaveOptions()
options.export_list_labels = options.export_list_labels.BY_HTML_TAGS
options.export_original_url_for_linked_images = True
options.export_images_as_base64 = True

for index, paragraph_object in enumerate(doc.get_child_nodes(aw.NodeType.PARAGRAPH, is_deep = True)):
    time1 = datetime.now()
    html_string = paragraph_object.to_string(options)
    time_difference = datetime.now() - time1
    print(time_difference.total_seconds())

1mb.docx (1 MB)
1mb.docx (1.00 MB)

alexey.noskov · October 12, 2024, 5:49am

@anshuman.tiwari1
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27476

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

As a temporary workaround you can import each node into a temporary document and then export whole temporary document to HTML:

doc = aw.Document("C:\\Temp\\in.txt")
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
    tmp = doc.clone(False).as_document()
    tmp.ensure_minimum()
    tmp.first_section.body.remove_all_children()
    tmp.first_section.body.append_child(tmp.import_node(p, True, aw.ImportFormatMode.USE_DESTINATION_STYLES))
    html_string = tmp.to_string(aw.SaveFormat.HTML)

anshuman.tiwari1 · October 15, 2024, 9:33am

Thanks @alexey.noskov for prompt reply.

This works for now but facing issue with list items.

If the paragraph is a list item, then the html string is not able to preserve the list label.

What changes should I make for preserving the same ?

alexey.noskov · October 15, 2024, 12:27pm

@anshuman.tiwari1 The only way I can suggest with the proposed workaround is converting list labels to simple text before extracting HTML node by node:

doc = aw.Document("C:\\Temp\\in.docx")
doc.update_list_labels()

# Convert list items into regular paragraphs with leading text that imitates numbering.
for p in doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) :
    p = p.as_paragraph()
    if p.is_list_item :
        label = p.list_label.label_string + "\t";
        fakeListLabelRun = aw.Run(doc, label)
        indent = p.list_format.list_level.number_position
        p.list_format.remove_numbers()
        p.prepend_child(fakeListLabelRun)
        p.paragraph_format.left_indent = indent

    tmp = doc.clone(False).as_document()
    tmp.ensure_minimum()
    tmp.first_section.body.remove_all_children()
    tmp.first_section.body.append_child(tmp.import_node(p, True, aw.ImportFormatMode.USE_DESTINATION_STYLES))
    html_string = tmp.to_string(aw.SaveFormat.HTML)
    print(html_string)
    print("===========================")