Numbering formatting is ignored when converting HTML to word


#1

Hi,

If we take a word document with numbering formatting (that means that word “knows” that this is a list and when adding a new line, a new number appears) and convert it to HTML and then convert it back to word, the numbering formatting is ignored.

This force us to manually reentering the formatting,

Can you please fix this issue?


#2

@omri-1

Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input Word document.
  • Please attach the output Word file that shows the undesired behavior.
  • Please attach the expected output Word file that shows the desired behavior.
  • Please create a standalone console application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.


#3

I’ve tried to simplify it as much as possible.
Here is the main code:

        var doc = new Aspose.Words.Document(@"E:\WordTest\26.docx");
        var sb = new StringBuilder();
        foreach (Aspose.Words.Section section in doc.Sections)
        {
            GetHtml(section.Body, sb);
        }

        File.WriteAllText(@"E:\WordTest\26.htm",
            sb.ToString());

        new Aspose.Words.Document(@"E:\WordTest\26.htm")
            .Save(@"E:\WordTest\26_2.docx", SaveFormat.Docx);

and here is the GetHtml code:

    private static void GetHtml(CompositeNode prentNode,StringBuilder sbHtml  )
    {
        if (prentNode != null)
        {
            foreach (Aspose.Words.Node node in prentNode.ChildNodes)
            {
                if (node.NodeType == NodeType.StructuredDocumentTag)
                {
                    var structuredDocumentTag = (StructuredDocumentTag)node;

                    bool containsSubStructuredDocumentTag = false;
                    foreach (Aspose.Words.Node cNode in structuredDocumentTag.ChildNodes)
                    {
                        if (cNode.NodeType == NodeType.StructuredDocumentTag)
                        {
                            containsSubStructuredDocumentTag = true;
                            break;
                        }
                    }

                    if (containsSubStructuredDocumentTag)
                    {
                        if (node is CompositeNode)
                        {
                            if (((CompositeNode)node).ChildNodes != null
                                && ((CompositeNode)node).ChildNodes.Count > 0)
                            {
                                GetHtml(node as CompositeNode, sbHtml);
                            }
                        }
                    }
                    else
                    {
                        string value = null;

                        if ((structuredDocumentTag.IsShowingPlaceholderText == false)
                            || (structuredDocumentTag.SdtType == SdtType.Date))
                        {
                            switch (structuredDocumentTag.SdtType)
                            {
                                case SdtType.Checkbox:
                                    value = structuredDocumentTag.Checked.ToString();
                                    break;
                                case SdtType.Date:
                                    if (structuredDocumentTag.FullDate != DateTime.MinValue)
                                    {
                                        try
                                        {
                                            value = Convert.ToDouble(structuredDocumentTag.FullDate.Ticks).ToString();
                                        }
                                        catch (Exception ex)
                                        {
                                            value = "0";
                                        }
                                    }
                                    else
                                    {
                                        value = "0";
                                    }
                                    break;
                                case SdtType.DropDownList:
                                    {
                                        value = structuredDocumentTag.ListItems.SelectedValue.Value;
                                    }
                                    break;
                                case SdtType.PlainText:
                                case SdtType.RichText:
                                    {
                                        if (structuredDocumentTag.SdtType == SdtType.RichText)
                                        {
                                            value = ReadAllNodesFromField(structuredDocumentTag, SaveFormat.Html);
                                        }
                                        else
                                        {
                                            value = ReadAllNodesFromField(structuredDocumentTag, SaveFormat.Text);
                                        }
                                    }
                                    break;
                                default:
                                    value = null;
                                    break;
                            }

                        }

                        if (!string.IsNullOrEmpty(value))
                        {
                            sbHtml.AppendLine(value);
                        }
                    }
                }
                else
                {
                    if (node is CompositeNode)
                    {
                        if (((CompositeNode)node).ChildNodes != null
                            && ((CompositeNode)node).ChildNodes.Count > 0)
                        {
                            GetHtml(node as CompositeNode, sbHtml);
                        }
                    }
                }
            }
        }

    }

    private static string ReadAllNodesFromField(StructuredDocumentTag structuredDocumentTag, SaveFormat format)
    {
        string text = string.Empty;

        if (format == SaveFormat.Html)
        {
            var saveOptions = new HtmlSaveOptions
            {
                HtmlVersion = Aspose.Words.Saving.HtmlVersion.Html5,
                ExportImagesAsBase64 = true,
                ExportHeadersFootersMode = Aspose.Words.Saving.ExportHeadersFootersMode.None,
                ExportListLabels = Aspose.Words.Saving.ExportListLabels.AsInlineText
            };

            foreach (Aspose.Words.Node textNode in structuredDocumentTag.ChildNodes)
            {
                text += textNode.ToString(saveOptions);
            }
        }
        else
        {
            foreach (Aspose.Words.Node textNode in structuredDocumentTag.ChildNodes)
            {
                text += textNode.ToString(format);
            }
        }

        return text;
    }

Attached the original word file (26.docx) .
26.zip (23.1 KB)

As you can see, the original file (26.docx) keeps its numbering formatting, but the new one (26_2.docx) lose it when converting to HTML and back.

Thanks!


#4

@omri-1

Thanks for sharing the detail. Please use HtmlSaveOptions.ExportListLabels as ExportListLabels.ByHtmlTags in your code to export all list labels as HTML native elements.

var saveOptions = new Aspose.Words.Saving.HtmlSaveOptions
{
    HtmlVersion = Aspose.Words.Saving.HtmlVersion.Html5,
    ExportImagesAsBase64 = true,
    ExportHeadersFootersMode = Aspose.Words.Saving.ExportHeadersFootersMode.None,
    ExportListLabels = Aspose.Words.Saving.ExportListLabels.ByHtmlTags
};

Please note that DOCX and HTML formats are quite different so sometimes it’s hard to achieve 100% fidelity. If you want to preserve same list numbers after conversion, please do not export each node to HTML. Please use the following code example.

var doc = new Aspose.Words.Document(MyDir + @"26.docx").Save(MyDir + @"26.htm");
new Aspose.Words.Document(MyDir + @"26.htm").Save(MyDir + @"26_2.docx", SaveFormat.Docx);

#5

Hi,
We have to export each node since each node represents a different document in our system.

Changing the ExportListLabels from “AsInlineText” to “ByHtmlTags” solves this problem but introduce an old one:


We set the ExportListLabels to “AsInlineText” by your recommendation to solve another problem.

Please provide an option that solves both problems.

Thanks!


#6

@omri-1

Thanks for your inquiry. Aspose.Words exports the nodes to HTML correctly. The input document has nine list items in the content control. Could you please share your expected output HTML for each node here for our reference? We will then provide you more information about your query.


#7

Sure, when using “ByHtmlTags” you convert the numbered list to ol and li tags.
These tags do not keep the same numbering format in HTML as they are in MSWord.

By adding the following CSS class to the ol items, you can achieve the same numbering format in HTML as it exists in MSWord:

<style>
        .my_ol {
            counter-reset: item
        }

        .my_ol li {
            display: block
        }

        .my_ol li:before {
            content: counters(item, ".") " ";
            counter-increment: item
        }
    </style>

Attached example (original and fixed HTML):
26_all_fixed.zip (1007 Bytes)

This solves the Word to HTML problem! (assuming you will add this CSS to your product in the next version)

However, this does not solve the HTML to Word problem.
When exporting the HTML with the suggested CSS back to Word it almost works, but not 100%. The only problem (after applying the CSS) is that sub-numbering always start with 1:

  1. aaa
    1.1 aaa
  2. aaa
    1.1 aaa [THIS IS THE PROBLEM, SHOULD BE 2.1]

You can see it by converting the example above.

new Aspose.Words.Document(@"E:\WordTest\26_all_fixed.htm")
                            .Save(@"E:\WordTest\26_all_fixed_2.docx", SaveFormat.Docx);

This we leave to your experts to solve :slight_smile:

Thanks!


#8

@omri-1

You are facing the expected behavior of Aspose.Words. It is not a bug in Aspose.Words. You can check this behavior by manually creating the HTML of list items and join the all HTML fragments into one single document.


#9

I’m not sure I understand your answer.

As far as we understand if we convert Word to HTML or HTML to Word and it does not look \ behave the same - this is a bug.

This case contains two bugs:

1 - When we convert Word to HTML, it does not look the same. We found the fix for you (see the CSS in my last post).

2 - When we convert HTML to Word, it does not look the same. We found partial the fix for you (see the CSS in my last post).


#10

@omri-1

We have convert the DOCX to HTML and HTML to DOCX using the latest version of Aspose.Words for .NET 19.3 with following code example. We have not found any issue with output DOCX. We have attached the output DOCX and HTML with this post for your kind reference.

Docs.zip (11.3 KB)

We suggest you please use Aspose.Words for .NET 19.3.

var doc = new Aspose.Words.Document(MyDir + @"26.docx").Save(MyDir + @"26.htm");
new Aspose.Words.Document(MyDir + @"26.htm").Save(MyDir + @"19.3.docx", SaveFormat.Docx);

#11

@tahir.manzoor as I wrote in my previous posts, I can’t use the Document.Save method since we need only parts of the Document (only the parts that are inside the StructuredDocumentTag).

If Document.Save works, that means that there is a bug in Node.ToString. Since it should give the same output. Can you share the Save method code please? or find the difference between the Document.Save to the Node.ToString?

Thanks


#12

@omri-1

We have logged your requirement in our issue tracking system as WORDSNET-18318. We will check the possibility of implementation of this feature and update you via this forum thread.


#13

Thank you.
This is not a feature, but a bug (a Document.Save output and Node.ToString output must be the same).
Please update when it will be fixed.


#14

@omri-1

Please note that Node.ToString does not export list styles. This method exports the HTML correctly.

The Document.Save method and Node.ToString method are two different methods and do not have the same functionality.

Please consider following list items. You are exporting both list items as separate HTML content. When you export a node to HTML, Node.ToString does not preserve any information about other list items.

1.1 List item 1
1.2 List item 2

Could you please share your expected HTML output (two separate HTML) for above two list items? We will then update the WORDSNET-18318 in our issue tracking system according to your requirement.


#15

@tahir.manzoor
You are right that if we export just one item in the list, it does not make sense to keep the formatting.
But this is not the case.

If we export a node (Paragraph, StructuredDocumentTag, etc) that contains the entire list, there should be no difference between this export to saving the entire document.
The fact that Node.ToString does not export list styles is the bug.


#16

@omri-1

Thanks for sharing the detail. We will inform you via this forum thread once this issue is resolved.


#17

Back to this workaround: Numbering formatting is ignored when converting HTML to word
This is also not good, since your word-to-html input includes “start” attribute (<ol start=“3” …) that does not work with css counter.
So please expend wordsnet 18318 to include both import and export…


#18

@omri-1

Could you please share some more detail about your query? We will then log your requirement in our issue tracking system accordingly.


#19

I’ll be happy to help, what more info do you need?


#20

@omri-1

There is no need of more detail. We logged your issues as WORDSNET-18318 to export correct styles for list items using Node.ToString method.