Insert HTML gibberish

Dear Sir

When we insert HTML into a Word document using the following code, messy characters appear, which is likely due to the content below.

“&#”

C# code as bellow

Document doc = new Document("c:\\temp\\1.docx", new Aspose.Words.Loading.LoadOptions() { LoadFormat = LoadFormat.Auto });
            
DocumentBuilder db = new DocumentBuilder(doc);
var htmlContenxt = "LTG tool pouch #902943 / &#902944 / &#902945";
db.MoveToCell(0, 0, 0, 0);
db.InsertHtml(htmlContenxt, HtmlInsertOptions.RemoveLastEmptyParagraph);
       
doc.Save("c:\\temp\\1_html.docx", SaveFormat.Docx);

@wengyeung The behavior is expected. In HTML, the syntax &# is used to introduce a numeric character reference (NCR), a method for displaying symbols not easily found on a standard keyboard.

The general format for using a numeric character reference is:

  • &#D; where D is the decimal (base-10) integer value corresponding to the character’s Unicode code point.
  • &#xH; where H is the hexadecimal (base-16) integer value for the Unicode code point (note the x after the #).

For example, &#60; (decimal) or &#x3C; (hexadecimal) will be displayed as < (less than sign) in HTML.

In your case, if put your text into a simple HTML file and open it in the browser, you will see exactly the same as you see in the document produced by Aspose.Words:

<html>
<body>
    <p><span>LTG tool pouch #902943 / &#902944 / &#902945</span></p>
</body>
</html>

If you need to preserve &#902944 and &#902945 as is, you should insert the content as simple text, not as HTML:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.Write("LTG tool pouch #902943 / &#902944 / &#902945");
doc.Save(@"C:\Temp\out.docx");

out.docx (7.0 KB)

Or, alternatively you should escape & character in the HTML string with &amp.

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.InsertHtml("LTG tool pouch #902943 / &amp#902944 / &amp#902945");
doc.Save(@"C:\Temp\out.docx");