@chrisbluemorgan
Overview
You can certainly:
- Clone a
Document.
- Save the clone to an HTML string (or any other intermediate format).
- Edit that HTML string.
- Load the edited HTML back into a
Document.
- Compare the edited document with the original.
The code you posted is close, but a few details are missing that affect round‑tripping and the Compare result.
Below is a complete, step‑by‑step example that shows the correct usage of the Aspose.Words API together with the options that help preserve formatting when converting to/from HTML.
1. Clone the original document
// originalDoc is the source Document you already have
Document clonedDoc = (Document)originalDoc.Clone(); // deep clone
2. Save the clone to an HTML string
string htmlString;
using (var htmlStream = new MemoryStream())
{
var htmlSaveOptions = new HtmlSaveOptions(SaveFormat.Html)
{
ExportImagesAsBase64 = true, // embed images → no external files
ExportFontsAsBase64 = true, // embed fonts
ExportHeadersFooters = true, // keep headers/footers
ExportNotes = true, // keep footnotes/endnotes
KeepOriginalPageNumbers = true,
CssStyleSheetType = CssStyleSheetType.Inline // keep styling inline (easier to edit)
};
clonedDoc.Save(htmlStream, htmlSaveOptions);
htmlString = Encoding.UTF8.GetString(htmlStream.ToArray());
}
Why these options?
ExportHeadersFooters, ExportNotes, KeepOriginalPageNumbers ensure that the HTML contains the same logical structures that exist in a DOCX file.
- Using inline CSS makes the HTML easier to manipulate manually without losing style information.
3. Edit the HTML string
You can now perform any text replacement, DOM manipulation, or other HTML edits you need.
(No Aspose code is required for this step; just string/HTML processing in your preferred way.)
// Example: replace a placeholder with real content
htmlString = htmlString.Replace("{PLACEHOLDER}", "New text");
4. Load the edited HTML back into a Document
Document editedDoc;
using (var editedHtmlStream = new MemoryStream(Encoding.UTF8.GetBytes(htmlString)))
{
var htmlLoadOptions = new HtmlLoadOptions
{
// Keep the same behavior that was used during saving
LoadFormat = LoadFormat.Html,
IgnoreMissingStyles = false, // keep undefined styles (helps preserve original layout)
PreserveOriginalShape = true // try to keep original drawing shapes
};
editedDoc = new Document(editedHtmlStream, htmlLoadOptions);
}
Important:
HtmlLoadOptions is the counterpart of HtmlSaveOptions. Setting the same “preserve” flags helps the round‑trip retain as much formatting as possible.
5. Compare the edited document with the original
// You can compare the original (unchanged) doc with the edited one
originalDoc.Compare(
editedDoc,
author: "RevisionAuthor",
revisionDateTime: DateTime.Now,
compareOptions: new CompareOptions
{
Granularity = Granularity.WordLevel,
IgnoreFootnotes = true,
IgnoreHeadersAndFooters = true // optional – set according to your needs
});
// The revisions are now stored inside originalDoc; you can inspect them, save, etc.
originalDoc.Save("ComparedResult.docx");
6. Full Sample (All Steps Together)
// ------------------------------------------------------------
// 1. Load the original document (from a byte[] in your case)
// ------------------------------------------------------------
Document originalDoc = new Document(new MemoryStream(originalBytes));
// ------------------------------------------------------------
// 2. Clone it
// ------------------------------------------------------------
Document clonedDoc = (Document)originalDoc.Clone();
// ------------------------------------------------------------
// 3. Save clone to HTML string
// ------------------------------------------------------------
string htmlString;
using (var ms = new MemoryStream())
{
var htmlSave = new HtmlSaveOptions(SaveFormat.Html)
{
ExportImagesAsBase64 = true,
ExportFontsAsBase64 = true,
ExportHeadersFooters = true,
ExportNotes = true,
CssStyleSheetType = CssStyleSheetType.Inline
};
clonedDoc.Save(ms, htmlSave);
htmlString = Encoding.UTF8.GetString(ms.ToArray());
}
// ------------------------------------------------------------
// 4. ----- MODIFY THE HTML STRING HERE -----
// ------------------------------------------------------------
htmlString = htmlString.Replace("{PLACEHOLDER}", "New content");
// ------------------------------------------------------------
// 5. Load the edited HTML back into a Document
// ------------------------------------------------------------
Document editedDoc;
using (var ms = new MemoryStream(Encoding.UTF8.GetBytes(htmlString)))
{
var htmlLoad = new HtmlLoadOptions
{
LoadFormat = LoadFormat.Html,
PreserveOriginalShape = true
};
editedDoc = new Document(ms, htmlLoad);
}
// ------------------------------------------------------------
// 6. Compare with the original
// ------------------------------------------------------------
originalDoc.Compare(
editedDoc,
"RevisionAuthor",
DateTime.Now,
new CompareOptions
{
Granularity = Granularity.WordLevel,
IgnoreFootnotes = true,
IgnoreHeadersAndFooters = true
});
originalDoc.Save("Result_With_Revisions.docx");
Why the Compare Result May Appear “Wrong”
When a DOCX → HTML → DOCX round‑trip is performed, some Word‑specific constructs cannot be perfectly represented in HTML (e.g., complex numbering, certain VML drawings, hidden text, some field codes, revision marks). Consequently the loaded HTML document may differ slightly from the original, and the Compare operation will flag those differences.
Tips to Minimize False Differences
| Aspect |
Recommendation |
| Headers / Footers |
Set ExportHeadersFooters = true when saving and IgnoreHeadersAndFooters = true (or false if you need them) when comparing. |
| Footnotes / Endnotes |
Use ExportNotes = true and IgnoreFootnotes = false if footnotes are important. |
| Images & Fonts |
Keep them as Base64 (ExportImagesAsBase64, ExportFontsAsBase64). |
| Numbering & Styles |
Preserve styles by keeping CssStyleSheetType.Inline and avoid ExportOriginalUrl‑based CSS. |
| Hidden/Deleted Text |
If you do not need hidden content, set htmlLoadOptions.PreserveHiddenText = false. |
| Fields |
Some fields (e.g., REF, TOC) become static text in HTML. If you need them as fields, consider editing the DOCX directly rather than via HTML. |
If after applying the options you still see many unwanted revisions, the most reliable way to make programmatic changes is to edit the Document object directly (using DocumentBuilder, node manipulation, etc.) rather than via HTML. This avoids any round‑trip loss.
References
TL;DR
- Clone the document.
- Save the clone to HTML with all “preserve” options turned on.
- Edit the HTML string.
- Load the edited HTML using
HtmlLoadOptions.
- Run
originalDoc.Compare(editedDoc, …).
If you need pixel‑perfect round‑tripping, edit the Document object directly rather than using HTML, because certain Word‑only features cannot be fully expressed in HTML.
Feel free to ask if you need a concrete example for a specific type of modification (e.g., replacing a placeholder, inserting a table, etc.). Happy coding!