Doc to HTML - Clean Up Unnecessary Spans

I am sure this is due to how nasty the word document code is stored by microsoft. I have noticed quite often that there are unnecessary spans in the generated HTML. Quite often a span ends mid word and begins again mid word. For example…

<font face="Garamond">ASPOSE.WORDS</font>

Would look like:

<span style="color=:#ff0000;font-style:italic;font-size:12pt;font-family:Garamond;">AS</span><span style="color=:#ff0000;font-style:italic;font-size:12pt;font-family:Garamond;">POSE.WO</span><span style="color=:#ff0000;font-style:italic;font-size:12pt;font-family:Garamond;">RDS</span>

The above is an example of two unnecessary spans with long definitions. I went through some documents and cleaned them up. It dramatically reduced the size of the generated HTML.

I suggest that Aspose.Words looks to see if a "</span><span" is exactly the same and removes unnecessary spans. The above HTML would then look much cleaner like this:

<span style="color=:#ff0000;font-style:italic;font-size:12pt;font-family:Garamond;">ASPOSE.WORDS</span>

Hi,
Thanks for your request. This can occur because text in your document consists of multiple Runs. Usually this occurs when you edit document multiple times in MS Word.
There is JoinRunsWithSameFormatting method, which concatenates runs with same formatting. So you can try just calling this method before saving document as HTML.
https://reference.aspose.com/words/net/aspose.words/document/joinrunswithsameformatting/
Best regards.

Thanks, I will try it out.

I just tried it and it doesn’t seem to work perfectly. I have uploaded an html file created using the following options:

Dim doc As New Document(inMessage)
doc.RemoveMacros()
doc.AcceptAllRevisions()
doc.SaveOptions.ExportImagesFolder = Server.MapPath(".") & "\temp\"
doc.SaveOptions.HtmlExportImagesFolderAlias = "/temp/"
doc.SaveOptions.ExportPrettyFormat = True
doc.JoinRunsWithSameFormatting()
doc.Save(dstStream, SaveFormat.Html)

As you can see in the result file, there are a few cases where there are adjacent spans that have the exact same attributes. We will likely resolve this on our end, but it would be nice to have it work without the additional code.

I would like to suggest an improvement (if it doesn’t already exist). Apparently when text gets changed, often the spaces don’t. Here is an example.
The below text should be unformatted (or formatted within it’s container):
The Commerce Department also reported that incomes rose 0.2% in September
Here is how it is generated with Aspose.Words HTML conversion:
The Commerce Department

also

reported

that incomes rose 0.2% in September
The problem is that at some point certain text was italicized. Later the italics were removed per word leaving the spaces italic.
I would like to suggest a function that converts formatted single spaces into unformatted single spaces. This is something we can do, but would also prefer to have as a function of Aspose…

Hi

Thank you for additional information. Yes, you are right; sometimes JoinRunsWithSameFormatting does not help. This is because in MS Word Run nodes can have different attributes, but these attributes are not exported into HTML. So we need to join SPANs with same formatting instead of joining Runs. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.
Also, I would like to thank you for your suggestion. I think it is useful and reasonable.
Best regards.

The issues you have found earlier (filed as 4773) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(13)