Cleanup HTML Code after Word-to-HTML Conversion

Good Morning,

I’m using Aspose.Word for .NET in a project and so far so good. I’m able to convert a Word Document to HTML and present the resultant HTML code to the user in an online editor for further editing before saving to a database. However, I’d like to cleanup the convered HTML code a little, before the user gets it. Meaning, I’d like to strip down the HTML code to its simplest form by removing all the extra stuff the conversion creates; that is, I only want <p> tags, <li> and <ul> tags, <h1-6> tags, <strong> tags, etc. No <Span> tags, <meta>, <div> or styling info.

I’ve come across some C# Code by others that uses Regular Expressions (regex) to clean up the Word HTML code, but it’s not complete and Regular Expressions are somewhat painful to use – at least for me.

So my question is whether or not Aspose.Words for .NET has any built in HTML cleanup functionality that one could implement as part of the Word to HTML conversion process, or does Aspose know of a routine / technique for Cleaning HTML code that has been converted from a Word Document in .NET 2.0 C#.

Thanks any help or insight,

Hi

Thanks for your inquiry. There is no way to “Clean up” HTML using Aspose.Words. However, exporting CSS into a separate file could help you to simplify your process:
https://reference.aspose.com/words/net/aspose.words.saving/htmlsaveoptions/cssstylesheettype/
In this case, to clean HTML you will need just remove class attribute from HTML elements.
Hope this helps.
Best regards,