Docx to HTML and back to Docx is losing format

Hi team, I am working on my project, which allow user to upload the Docx, we convert it to HTML, do some stuff and convert back to Docx, but sometime the format of the newly generated Docx is different with the original one. I am using Aspose.Words, nuget version 23.10.0 and a paid user in Joblogic company, I can provide you the license details if needed.
Below is the sample of my codes, which is pretty simple, read Docx to Document, convert it to HTML string and save it back to Docx, and attachments are the file I used to test. Can you help to take a look and let me know what did I do wrong or what setting I can use to improve it?
Tests.zip (211.5 KB)

public class DocxTemplateController : ControllerBase
{
    [HttpPost(nameof(Generate))]
    public async Task<ActionResult> Generate([FromForm] GenerateDocxTemplateRequest request, CancellationToken cancellationToken)
    {
        var document = GetDocxDocument(request);

        var documentText = document.ToString(SaveFormat.Html);

        return await DownloadDocxAsync(documentText);
    }

    private static Document GetDocxDocument(GenerateDocxTemplateRequest request)
    {
        var document = new Document();

        using (var stream = request.Input.OpenReadStream())
        {
            document = new Document(stream);
        }

        return document;
    }

    private async Task<ActionResult> DownloadDocxAsync(string htmlContent)
    {
        var document = new Document();

        var builder = new DocumentBuilder(document);
        builder.InsertHtml(htmlContent);

        var ms1 = new MemoryStream();
        document.Save(ms1, SaveFormat.Docx);
        ms1.Position = 0;
        return File(ms1, "application/vnd.openxmlformats-officedocument.wordprocessingml.document", $"DocxTemplate-{DateTime.UtcNow.ToString("yyyMMdd-HHmm")}.docx");
    }
}

@JamesNguyen Please note, Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one format to another. In most cases Aspose.Words mimics MS Word behavior when work with HTML documents. Unfortunately, it is impossible to fully preserve the original document structure and layout after DOCX->HTML->DOCX roundtrip.

Hi @alexey.noskov, thanks for your quick response, is there any setting from HtmlSaveOptions which you recommend to improve this other than using below

document.ToString(SaveFormat.Html)

Can you try with the tests I attached before and improve the result?
Thanks, James

I changed the code to use SaveOptions and it improve a bit

HtmlSaveOptions saveOptions = new HtmlSaveOptions();
saveOptions.ExportRoundtripInformation = true;
saveOptions.ExportImagesAsBase64 = true;
saveOptions.ExportPageMargins = true;
saveOptions.AllowNegativeIndent = true;
saveOptions.ExportPageSetup = true;
saveOptions.ExportDocumentProperties = true;
saveOptions.ExportHeadersFootersMode = ExportHeadersFootersMode.PerSection;
var documentText = document.ToString(saveOptions);

but some elements are overlapping like attached, is there any setting which I can use to overcome this?


For more information, the format already broken when convert from Docx to HTML. I saw that the overlapped element is image, where it should be table or paragraph in original docx, why it behave like that and what is the fix for it?
Thanks in advance,
James

@JamesNguyen I have played with your documents, but unfortunately, as I have mentioned, there is no way to provide 100% fidelity upon DOCX->HTML->DOCX roundtrip dues to differences in HTML and DOCX object models. Especially with complex documents like yours.

In your documents floating textboxes are used with tables inside. Such textboxes are exported to HTML as images as an attempt to preserve their original position. But again due to difference in models it is not possible to preserve such objects accurately in HTML.

Aspose.Words provides HtmlFixed save format in addition to flow HTML format. HtmlFixed format is designed to preserve original document layout for viewing purposes. So if your goal is to display the HTML on page, then this format can be considered as an alternative. But unfortunately, it does not support roundtrip to DOCX at all.