Clone Document, Convert Clone .ToString(SaveFormat.Html), Convert HTML back to Document, and Compare Changes

I am having some trouble with a requirement. I need to make changes to a clone of a document while it is in HTML string format (or possibly XML or any other format that will allow me to keep the formatting). After the changes are complete, I then would like to convert that HTML back into a Document and compare changes to the original. I will share some code snippets of what I’ve tried below. Keep in mind that this doesn’t include all of the logic such as making changes to the html, i’m just simply trying to get the HTML string, and then convert that back into a document that is exactly the same as the original.

byte[] htmlBytes;
using (var htmlStream = new MemoryStream())
{
    var htmlSaveOptions = new HtmlSaveOptions(SaveFormat.Html)
    {
        ExportImagesAsBase64 = true,
        ExportFontsAsBase64 = true
    };

    originalDoc.Save(htmlStream, htmlSaveOptions);
    htmlBytes = htmlStream.ToArray();
}


Document htmlDoc;
using (var htmlInputStream = new MemoryStream(htmlBytes))
{
    htmlDoc = new Document(htmlInputStream);
}

byte[] modifiedDocumentData;
using (var docxStream = new MemoryStream())
{
    htmlDoc.Save(docxStream, SaveFormat.Docx);
    modifiedDocumentData = docxStream.ToArray();
}

            var originalDoc = ProcessingFunctionHelpers.LoadDocumentAsync(input.OriginalDocumentData, logger);
            var processedDoc = ProcessingFunctionHelpers.LoadDocumentAsync(input.ProcessedDocumentData, logger);

        originalDoc.Compare(processedDoc, Constants.REVISION_AUTHOR, DateTime.Now, new CompareOptions()
        {
            Granularity = Granularity.WordLevel,
            IgnoreFootnotes = true,
            IgnoreHeadersAndFooters = true
        });


public static Document LoadDocumentAsync(byte[] stream, ILogger logger)
{
    try
    {
        using (var memStream = new MemoryStream(stream))
        {
            return new Document(memStream);
        }
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Failed to load the document.");
        throw;
    }
}

Sherlock Holmes (Test).docx (130.3 KB)

@chrisbluemorgan

Overview

You can certainly:

  1. Clone a Document.
  2. Save the clone to an HTML string (or any other intermediate format).
  3. Edit that HTML string.
  4. Load the edited HTML back into a Document.
  5. Compare the edited document with the original.

The code you posted is close, but a few details are missing that affect round‑tripping and the Compare result.

Below is a complete, step‑by‑step example that shows the correct usage of the Aspose.Words API together with the options that help preserve formatting when converting to/from HTML.


1. Clone the original document

// originalDoc is the source Document you already have
Document clonedDoc = (Document)originalDoc.Clone();   // deep clone

2. Save the clone to an HTML string

string htmlString;
using (var htmlStream = new MemoryStream())
{
    var htmlSaveOptions = new HtmlSaveOptions(SaveFormat.Html)
    {
        ExportImagesAsBase64 = true,          // embed images → no external files
        ExportFontsAsBase64  = true,          // embed fonts
        ExportHeadersFooters = true,          // keep headers/footers
        ExportNotes = true,                   // keep footnotes/endnotes
        KeepOriginalPageNumbers = true,
        CssStyleSheetType = CssStyleSheetType.Inline   // keep styling inline (easier to edit)
    };

    clonedDoc.Save(htmlStream, htmlSaveOptions);
    htmlString = Encoding.UTF8.GetString(htmlStream.ToArray());
}

Why these options?

  • ExportHeadersFooters, ExportNotes, KeepOriginalPageNumbers ensure that the HTML contains the same logical structures that exist in a DOCX file.
  • Using inline CSS makes the HTML easier to manipulate manually without losing style information.

3. Edit the HTML string

You can now perform any text replacement, DOM manipulation, or other HTML edits you need.
(No Aspose code is required for this step; just string/HTML processing in your preferred way.)

// Example: replace a placeholder with real content
htmlString = htmlString.Replace("{PLACEHOLDER}", "New text");

4. Load the edited HTML back into a Document

Document editedDoc;
using (var editedHtmlStream = new MemoryStream(Encoding.UTF8.GetBytes(htmlString)))
{
    var htmlLoadOptions = new HtmlLoadOptions
    {
        // Keep the same behavior that was used during saving
        LoadFormat = LoadFormat.Html,
        IgnoreMissingStyles = false,      // keep undefined styles (helps preserve original layout)
        PreserveOriginalShape = true      // try to keep original drawing shapes
    };

    editedDoc = new Document(editedHtmlStream, htmlLoadOptions);
}

Important:

  • HtmlLoadOptions is the counterpart of HtmlSaveOptions. Setting the same “preserve” flags helps the round‑trip retain as much formatting as possible.

5. Compare the edited document with the original

// You can compare the original (unchanged) doc with the edited one
originalDoc.Compare(
    editedDoc,
    author: "RevisionAuthor",
    revisionDateTime: DateTime.Now,
    compareOptions: new CompareOptions
    {
        Granularity = Granularity.WordLevel,
        IgnoreFootnotes = true,
        IgnoreHeadersAndFooters = true   // optional – set according to your needs
    });

// The revisions are now stored inside originalDoc; you can inspect them, save, etc.
originalDoc.Save("ComparedResult.docx");

6. Full Sample (All Steps Together)

// ------------------------------------------------------------
// 1. Load the original document (from a byte[] in your case)
// ------------------------------------------------------------
Document originalDoc = new Document(new MemoryStream(originalBytes));

// ------------------------------------------------------------
// 2. Clone it
// ------------------------------------------------------------
Document clonedDoc = (Document)originalDoc.Clone();

// ------------------------------------------------------------
// 3. Save clone to HTML string
// ------------------------------------------------------------
string htmlString;
using (var ms = new MemoryStream())
{
    var htmlSave = new HtmlSaveOptions(SaveFormat.Html)
    {
        ExportImagesAsBase64 = true,
        ExportFontsAsBase64   = true,
        ExportHeadersFooters = true,
        ExportNotes           = true,
        CssStyleSheetType     = CssStyleSheetType.Inline
    };
    clonedDoc.Save(ms, htmlSave);
    htmlString = Encoding.UTF8.GetString(ms.ToArray());
}

// ------------------------------------------------------------
// 4. ----- MODIFY THE HTML STRING HERE -----
// ------------------------------------------------------------
htmlString = htmlString.Replace("{PLACEHOLDER}", "New content");

// ------------------------------------------------------------
// 5. Load the edited HTML back into a Document
// ------------------------------------------------------------
Document editedDoc;
using (var ms = new MemoryStream(Encoding.UTF8.GetBytes(htmlString)))
{
    var htmlLoad = new HtmlLoadOptions
    {
        LoadFormat = LoadFormat.Html,
        PreserveOriginalShape = true
    };
    editedDoc = new Document(ms, htmlLoad);
}

// ------------------------------------------------------------
// 6. Compare with the original
// ------------------------------------------------------------
originalDoc.Compare(
    editedDoc,
    "RevisionAuthor",
    DateTime.Now,
    new CompareOptions
    {
        Granularity = Granularity.WordLevel,
        IgnoreFootnotes = true,
        IgnoreHeadersAndFooters = true
    });

originalDoc.Save("Result_With_Revisions.docx");

Why the Compare Result May Appear “Wrong”

When a DOCX → HTML → DOCX round‑trip is performed, some Word‑specific constructs cannot be perfectly represented in HTML (e.g., complex numbering, certain VML drawings, hidden text, some field codes, revision marks). Consequently the loaded HTML document may differ slightly from the original, and the Compare operation will flag those differences.

Tips to Minimize False Differences

Aspect Recommendation
Headers / Footers Set ExportHeadersFooters = true when saving and IgnoreHeadersAndFooters = true (or false if you need them) when comparing.
Footnotes / Endnotes Use ExportNotes = true and IgnoreFootnotes = false if footnotes are important.
Images & Fonts Keep them as Base64 (ExportImagesAsBase64, ExportFontsAsBase64).
Numbering & Styles Preserve styles by keeping CssStyleSheetType.Inline and avoid ExportOriginalUrl‑based CSS.
Hidden/Deleted Text If you do not need hidden content, set htmlLoadOptions.PreserveHiddenText = false.
Fields Some fields (e.g., REF, TOC) become static text in HTML. If you need them as fields, consider editing the DOCX directly rather than via HTML.

If after applying the options you still see many unwanted revisions, the most reliable way to make programmatic changes is to edit the Document object directly (using DocumentBuilder, node manipulation, etc.) rather than via HTML. This avoids any round‑trip loss.


References


TL;DR

  1. Clone the document.
  2. Save the clone to HTML with all “preserve” options turned on.
  3. Edit the HTML string.
  4. Load the edited HTML using HtmlLoadOptions.
  5. Run originalDoc.Compare(editedDoc, …).

If you need pixel‑perfect round‑tripping, edit the Document object directly rather than using HTML, because certain Word‑only features cannot be fully expressed in HTML.

Feel free to ask if you need a concrete example for a specific type of modification (e.g., replacing a placeholder, inserting a table, etc.). Happy coding!

@chrisbluemorgan You should note, Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one format to another. So after DOCX->HTML->DOCX roundtrip and comparing the resulting document with the original DOCX Aspose.Words might detect differences which are result of loses upon conversion one format to another.

If your goal is to convert document to string and then convert the string back without data and formatting loses, you can consider using FlatOpc (Ms Word 2007 XML) format instead HTML:

Document doc = new Document(@"C:\Temp\in.docx");

// Conver document to Word 2007 XML
string xml;
using (MemoryStream ms = new MemoryStream())
{
    doc.Save(ms, SaveFormat.FlatOpc);
    xml = Encoding.UTF8.GetString(ms.ToArray());
}

// Convert Word 2007 XML string to DOCX
using (MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes(xml)))
{
    Document doc1 = new Document(ms);
    doc1.Save(@"C:\Temp\out.docx");
}

Hi Alexey,

Using the FlatOPC Format works way better than HTML for my purposes. Thanks so much!

1 Like