Get cleaner HTML when converting from word

Can anybody suggest the options that are available in Aspose.Word to output cleaner html when converting documents into Html?
Thanks

Hi
Thanks for your request. Could you please clarify what you mean when say “cleaner HTML”? There are few options that allows you to control how styles are output to HTML:
https://reference.aspose.com/words/net/aspose.words.saving/cssstylesheettype/
You can try using Embedded or External options to get cleaner HTML.
Also, sometimes output HTML produced by Aspose.Words contains too many SPAN tags. This can occur because text in the inout document consists of multiple Runs. Usually this occurs when you edit document multiple times in MS Word.
There is JoinRunsWithSameFormatting method, which concatenates runs with same formatting. So you can try just calling this method before saving document as HTML.
https://reference.aspose.com/words/net/aspose.words/document/joinrunswithsameformatting/
If you would like to export HTML without styles, then unfortunately, currently there is not such option. We will consider adding such option in future.
Best regards.

Hi,
Thanks for the quick reply. This is what we exactly looking for; to reduce the size of the html output and your sugesstions helped a lot. Any more way we can acheive this?
Thanks.

Hi
Thanks for your request. Unfortunately, currently there is no other ways to reduce output HTML size.
Best regards,

Hi there,

Thanks for your inquiry.

Just a suggestion, but if you are looking for very simple HTML output then you may be able to achieve this manually by writing your own exporter using DocumentVisitor which walks over the document nodes and writes the HTML.

I think such a technique would not be too hard to implement, please see the following article for a demonstration: https://reference.aspose.com/words/net/aspose.words/documentvisitor/

Thanks,

Thanks for the reply. I have set the CssStyleSheetType property to CssStyleSheetType.Embedded. It reduced the file size a lot. But still I can see some inline styles for paragraph and span elements. How that happens? If there is anyway we can move the styles into the embedded css?

Hi
Thanks for your request. This is expected. Some formatting in Ms Word document is specified directly to elements in this case formatting output to HTML and inline styles. For example, formatting of text in MS Word documents can be defined on few different levels:

  1. Paragraph style defined for the particular paragraph;
  2. Character style defined for the particular run;
  3. Explicit formatting specified for a particular run.
    Best regards,

Hi,

I’m trying to do the same thing. The links are no longer active. Can you direct me with the exact code or new links.
Thank you,
Melissa
DocuMed

Hi Melissa,

Thanks for your inquiry. Please find the required information at the following places:

https://reference.aspose.com/words/net/aspose.words.saving/cssstylesheettype/
https://reference.aspose.com/words/net/aspose.words/document/joinrunswithsameformatting/

Please let us know if you have any troubles and we will be glad to look into this further for you.

Best regards,

How can I get the html & specify the css style format? Below is the code I’m currently using to get the HTML.
Thank you.

using (MemoryStream stream = new MemoryStream())
{
    docClone.Save(stream, SaveFormat.Html);
    html = Encoding.UTF8.GetString(stream.GetBuffer(), 0, (int)stream.Length);
}

Hi Melissa,

Thanks for your inquiry. You can specify how CSS styles are exported to HTML by using the HtmlSaveOptions.CssStyleSheetType property. For example, if you specify the ‘Inline’ value, then the CSS styles are written inline (as a value of the style attribute on every element).

Best regards,