Numbering formatting is ignored when converting HTML to word

omri-1 · February 28, 2019, 3:41pm

Hi,

If we take a word document with numbering formatting (that means that word “knows” that this is a list and when adding a new line, a new number appears) and convert it to HTML and then convert it back to word, the numbering formatting is ignored.

This force us to manually reentering the formatting,

Can you please fix this issue?

tahir.manzoor · February 28, 2019, 4:36pm

@omri-1

Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

Your input Word document.
Please attach the output Word file that shows the undesired behavior.
Please attach the expected output Word file that shows the desired behavior.
Please create a standalone console application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

omri-1 · March 12, 2019, 11:44am

I’ve tried to simplify it as much as possible.
Here is the main code:

        var doc = new Aspose.Words.Document(@"E:\WordTest\26.docx");
        var sb = new StringBuilder();
        foreach (Aspose.Words.Section section in doc.Sections)
        {
            GetHtml(section.Body, sb);
        }

        File.WriteAllText(@"E:\WordTest\26.htm",
            sb.ToString());

        new Aspose.Words.Document(@"E:\WordTest\26.htm")
            .Save(@"E:\WordTest\26_2.docx", SaveFormat.Docx);

and here is the GetHtml code:

    private static void GetHtml(CompositeNode prentNode,StringBuilder sbHtml  )
    {
        if (prentNode != null)
        {
            foreach (Aspose.Words.Node node in prentNode.ChildNodes)
            {
                if (node.NodeType == NodeType.StructuredDocumentTag)
                {
                    var structuredDocumentTag = (StructuredDocumentTag)node;

                    bool containsSubStructuredDocumentTag = false;
                    foreach (Aspose.Words.Node cNode in structuredDocumentTag.ChildNodes)
                    {
                        if (cNode.NodeType == NodeType.StructuredDocumentTag)
                        {
                            containsSubStructuredDocumentTag = true;
                            break;
                        }
                    }

                    if (containsSubStructuredDocumentTag)
                    {
                        if (node is CompositeNode)
                        {
                            if (((CompositeNode)node).ChildNodes != null
                                && ((CompositeNode)node).ChildNodes.Count > 0)
                            {
                                GetHtml(node as CompositeNode, sbHtml);
                            }
                        }
                    }
                    else
                    {
                        string value = null;

                        if ((structuredDocumentTag.IsShowingPlaceholderText == false)
                            || (structuredDocumentTag.SdtType == SdtType.Date))
                        {
                            switch (structuredDocumentTag.SdtType)
                            {
                                case SdtType.Checkbox:
                                    value = structuredDocumentTag.Checked.ToString();
                                    break;
                                case SdtType.Date:
                                    if (structuredDocumentTag.FullDate != DateTime.MinValue)
                                    {
                                        try
                                        {
                                            value = Convert.ToDouble(structuredDocumentTag.FullDate.Ticks).ToString();
                                        }
                                        catch (Exception ex)
                                        {
                                            value = "0";
                                        }
                                    }
                                    else
                                    {
                                        value = "0";
                                    }
                                    break;
                                case SdtType.DropDownList:
                                    {
                                        value = structuredDocumentTag.ListItems.SelectedValue.Value;
                                    }
                                    break;
                                case SdtType.PlainText:
                                case SdtType.RichText:
                                    {
                                        if (structuredDocumentTag.SdtType == SdtType.RichText)
                                        {
                                            value = ReadAllNodesFromField(structuredDocumentTag, SaveFormat.Html);
                                        }
                                        else
                                        {
                                            value = ReadAllNodesFromField(structuredDocumentTag, SaveFormat.Text);
                                        }
                                    }
                                    break;
                                default:
                                    value = null;
                                    break;
                            }

                        }

                        if (!string.IsNullOrEmpty(value))
                        {
                            sbHtml.AppendLine(value);
                        }
                    }
                }
                else
                {
                    if (node is CompositeNode)
                    {
                        if (((CompositeNode)node).ChildNodes != null
                            && ((CompositeNode)node).ChildNodes.Count > 0)
                        {
                            GetHtml(node as CompositeNode, sbHtml);
                        }
                    }
                }
            }
        }

    }

    private static string ReadAllNodesFromField(StructuredDocumentTag structuredDocumentTag, SaveFormat format)
    {
        string text = string.Empty;

        if (format == SaveFormat.Html)
        {
            var saveOptions = new HtmlSaveOptions
            {
                HtmlVersion = Aspose.Words.Saving.HtmlVersion.Html5,
                ExportImagesAsBase64 = true,
                ExportHeadersFootersMode = Aspose.Words.Saving.ExportHeadersFootersMode.None,
                ExportListLabels = Aspose.Words.Saving.ExportListLabels.AsInlineText
            };

            foreach (Aspose.Words.Node textNode in structuredDocumentTag.ChildNodes)
            {
                text += textNode.ToString(saveOptions);
            }
        }
        else
        {
            foreach (Aspose.Words.Node textNode in structuredDocumentTag.ChildNodes)
            {
                text += textNode.ToString(format);
            }
        }

        return text;
    }

Attached the original word file (26.docx) .
26.zip (23.1 KB)

As you can see, the original file (26.docx) keeps its numbering formatting, but the new one (26_2.docx) lose it when converting to HTML and back.

Thanks!

tahir.manzoor · March 12, 2019, 3:23pm

@omri-1

Thanks for sharing the detail. Please use HtmlSaveOptions.ExportListLabels as ExportListLabels.ByHtmlTags in your code to export all list labels as HTML native elements.

var saveOptions = new Aspose.Words.Saving.HtmlSaveOptions
{
    HtmlVersion = Aspose.Words.Saving.HtmlVersion.Html5,
    ExportImagesAsBase64 = true,
    ExportHeadersFootersMode = Aspose.Words.Saving.ExportHeadersFootersMode.None,
    ExportListLabels = Aspose.Words.Saving.ExportListLabels.ByHtmlTags
};

Please note that DOCX and HTML formats are quite different so sometimes it’s hard to achieve 100% fidelity. If you want to preserve same list numbers after conversion, please do not export each node to HTML. Please use the following code example.

var doc = new Aspose.Words.Document(MyDir + @"26.docx").Save(MyDir + @"26.htm");
new Aspose.Words.Document(MyDir + @"26.htm").Save(MyDir + @"26_2.docx", SaveFormat.Docx);

omri-1 · March 12, 2019, 4:28pm

Hi,
We have to export each node since each node represents a different document in our system.

Changing the ExportListLabels from “AsInlineText” to “ByHtmlTags” solves this problem but introduce an old one:

We set the ExportListLabels to “AsInlineText” by your recommendation to solve another problem.

Please provide an option that solves both problems.

Thanks!

tahir.manzoor · March 12, 2019, 6:56pm

@omri-1

Thanks for your inquiry. Aspose.Words exports the nodes to HTML correctly. The input document has nine list items in the content control. Could you please share your expected output HTML for each node here for our reference? We will then provide you more information about your query.

omri-1 · March 14, 2019, 3:07pm

Sure, when using “ByHtmlTags” you convert the numbered list to ol and li tags.
These tags do not keep the same numbering format in HTML as they are in MSWord.

By adding the following CSS class to the ol items, you can achieve the same numbering format in HTML as it exists in MSWord:

<style>
        .my_ol {
            counter-reset: item
        }

        .my_ol li {
            display: block
        }

        .my_ol li:before {
            content: counters(item, ".") " ";
            counter-increment: item
        }
    </style>

Attached example (original and fixed HTML):
26_all_fixed.zip (1007 Bytes)

This solves the Word to HTML problem! (assuming you will add this CSS to your product in the next version)

However, this does not solve the HTML to Word problem.
When exporting the HTML with the suggested CSS back to Word it almost works, but not 100%. The only problem (after applying the CSS) is that sub-numbering always start with 1:

aaa
1.1 aaa
aaa
1.1 aaa [THIS IS THE PROBLEM, SHOULD BE 2.1]

You can see it by converting the example above.

new Aspose.Words.Document(@"E:\WordTest\26_all_fixed.htm")
                            .Save(@"E:\WordTest\26_all_fixed_2.docx", SaveFormat.Docx);

This we leave to your experts to solve

Thanks!

tahir.manzoor · March 14, 2019, 5:47pm

@omri-1

You are facing the expected behavior of Aspose.Words. It is not a bug in Aspose.Words. You can check this behavior by manually creating the HTML of list items and join the all HTML fragments into one single document.

omri-1 · March 14, 2019, 5:59pm

I’m not sure I understand your answer.

As far as we understand if we convert Word to HTML or HTML to Word and it does not look \ behave the same - this is a bug.

This case contains two bugs:

1 - When we convert Word to HTML, it does not look the same. We found the fix for you (see the CSS in my last post).

2 - When we convert HTML to Word, it does not look the same. We found partial the fix for you (see the CSS in my last post).

tahir.manzoor · March 15, 2019, 3:45am

@omri-1

We have convert the DOCX to HTML and HTML to DOCX using the latest version of Aspose.Words for .NET 19.3 with following code example. We have not found any issue with output DOCX. We have attached the output DOCX and HTML with this post for your kind reference.

Docs.zip (11.3 KB)

We suggest you please use Aspose.Words for .NET 19.3.

var doc = new Aspose.Words.Document(MyDir + @"26.docx").Save(MyDir + @"26.htm");
new Aspose.Words.Document(MyDir + @"26.htm").Save(MyDir + @"19.3.docx", SaveFormat.Docx);

omri-1 · March 15, 2019, 7:39am

@tahir.manzoor as I wrote in my previous posts, I can’t use the Document.Save method since we need only parts of the Document (only the parts that are inside the StructuredDocumentTag).

If Document.Save works, that means that there is a bug in Node.ToString. Since it should give the same output. Can you share the Save method code please? or find the difference between the Document.Save to the Node.ToString?

Thanks

tahir.manzoor · March 15, 2019, 5:59pm

@omri-1

We have logged your requirement in our issue tracking system as WORDSNET-18318. We will check the possibility of implementation of this feature and update you via this forum thread.

omri-1 · March 15, 2019, 6:16pm

Thank you.
This is not a feature, but a bug (a Document.Save output and Node.ToString output must be the same).
Please update when it will be fixed.

tahir.manzoor · March 16, 2019, 5:44am

@omri-1

Please note that Node.ToString does not export list styles. This method exports the HTML correctly.

The Document.Save method and Node.ToString method are two different methods and do not have the same functionality.

Please consider following list items. You are exporting both list items as separate HTML content. When you export a node to HTML, Node.ToString does not preserve any information about other list items.

1.1 List item 1
1.2 List item 2

Could you please share your expected HTML output (two separate HTML) for above two list items? We will then update the WORDSNET-18318 in our issue tracking system according to your requirement.

omri-1 · March 17, 2019, 2:37pm

@tahir.manzoor
You are right that if we export just one item in the list, it does not make sense to keep the formatting.
But this is not the case.

If we export a node (Paragraph, StructuredDocumentTag, etc) that contains the entire list, there should be no difference between this export to saving the entire document.
The fact that Node.ToString does not export list styles is the bug.

tahir.manzoor · March 17, 2019, 5:46pm

@omri-1

Thanks for sharing the detail. We will inform you via this forum thread once this issue is resolved.

omri-1 · April 23, 2019, 3:43pm

Back to this workaround: Numbering formatting is ignored when converting HTML to word - #7 by omri-1
This is also not good, since your word-to-html input includes “start” attribute (<ol start=“3” …) that does not work with css counter.
So please expend wordsnet 18318 to include both import and export…

tahir.manzoor · April 23, 2019, 5:20pm

@omri-1

Could you please share some more detail about your query? We will then log your requirement in our issue tracking system accordingly.

omri-1 · July 11, 2019, 9:06am

I’ll be happy to help, what more info do you need?

tahir.manzoor · July 11, 2019, 7:00pm

@omri-1

There is no need of more detail. We logged your issues as WORDSNET-18318 to export correct styles for list items using Node.ToString method.