HTML conversion splits word into two parts using SPAN

rkeller · June 9, 2010, 12:02pm

Hi. I am having a problem with converting .doc to .html with aspose.words. I have attached two Word documents. Each has the word “SUBJECT” in the document and the documents appear identical from the user perspective. When File1.doc is converted, I get the following result:

SUBJECT

This result splits the HTML into ‘SUBJEC’ and ‘T’ using two SPANs.

File2 produces what I think is a correct result using just one SPAN:

SUBJECT

I attempted to look for any special characters in the Word documents within the word “SUBJECT” and don’t find any. While both documents render properly, I am trying to parse the HTML for the word “SUBJECT”, which gets more difficult with the first conversion.

Any ideas what is happening?

alexey.noskov · June 9, 2010, 12:54pm

Hi,
Thanks for your request. This can occur because text in your document consists of multiple Runs. Usually this occurs when you edit document multiple times in MS Word.
There is JoinRunsWithSameFormatting method, which concatenates runs with same formatting. So you can try just calling this method before saving document as HTML.
https://reference.aspose.com/words/net/aspose.words/document/joinrunswithsameformatting/
Best regards.