Compare Word and "HTML output from Word"

Hello,

I would like to compare Word and “HTML output from Word”, I have tried several ways, but I can’t get these two objects to be equal.

I’m not so much interested in formatting, but more in text, tables, numbering, etc…

I tried to convert both of these formats to MarkDown, but Word and HTML have different output formats that cannot be compared. (Maybe it would work, but I would have to replace a lot)

Could you direct me to how I could solve this problem?

TB

@benestom

Can you please provide more details on the specific methods or code you have tried for comparing Word and HTML outputs?

For HTML:

 //html
 using (HTMLDocument document = new HTMLDocument(htmlPath))
 {
     // Extrakce HTML obsahu
     string extractedHtml = document.DocumentElement.OuterHTML;

     // Konverze HTML → Markdown
     var converter = new Converter();
     string markdownString = converter.Convert(extractedHtml);

     // Uložení Markdownu do souboru
     File.WriteAllText(Path.Combine(outputFolder, "outputHtml.md"), markdownString);
 }

For Word

Document doc = new Document(docxPath);
doc.Range.Bookmarks.Clear();
MarkdownSaveOptions options = new MarkdownSaveOptions
{
    ImagesFolder = "images",  // Ukládání obrázků do složky
    TableContentAlignment = TableContentAlignment.Left, // Zarovnání tabulek jako v HTML
    ListExportMode = MarkdownListExportMode.MarkdownSyntax, // Použití standardního Markdownu pro seznamy
    ParagraphBreak = "  \n",  // Správné formátování odstavců pro lepší čitelnost
    ExportHeadersFootersMode = TxtExportHeadersFootersMode.None
};
doc.Save(Path.Combine(outputFolder, "outputWord.md"), options);

0TS215pV02.docx (2.6 MB)

this document save as HTML and compare
0TS215pV02.docx (2.65 MB)

@benestom I am afraid it is technically impossible. You should note, that Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one model to another. So there might be fidelity losses after converting MS Word document to HTML and additional losses after converting HTML back to MS Word or Aspose.Words DOM.

thank you, I think so too.

But can’t you think of a way to compare based only on the text?

Thanks in advance

@benestom If the source document has been converted to HTML using Aspose.Words, then modified, you can try converting the original document to HTML too for comparison. In such scenario both documents will be converted to HTML using the same engine so the difference will be minimal. Then you can convert these HTMLs to txt or markdown using the same tool for comparison.

Yes, I can imagine comparing two identical formats, but an error can occur during the conversion and then the error cannot be detected. I need a control mechanism that 100% confirms that the WORD to HTML conversion matches the content. Can’t you think of a way to achieve this?

@benestom Unfortunately, if the formats of documents for comparison are different there is no guarantee to get correct comparison result.