DOCX to HTML conversion issue with single quote character using C#

We use Aspose.Words component to convert our word file into html file. However, the single quote character in the source word file is changed to a different special character in the target html file. This converted character shows up in browse as question mark.

Following is our conversion code

private string GetHTMLFromWord(string path)
{
    Stream wordStream = new MemoryStream(File.ReadAllBytes(path));
    Stream htmlStream = new MemoryStream();
    Aspose.Words.Document document = new Aspose.Words.Document(wordStream);
    document.RemoveMacros();
    document.Save(htmlStream, Aspose.Words.SaveFormat.Html);
    htmlStream.Position = 0;
    StreamReader sr = new StreamReader(htmlStream);
    return sr.ReadToEnd();
}

Attached please find my source Test.docx file and generated Test.html file (please rename it because html extension is not allow as attachment here)

We are using Aspose.Words 9.4.0.0 and our development environment is Visual Studio 2010 in Windows 7.

Additional information
I just upgraded our Aspose.Words component to 10.1.0.0, but the issue is till remaining.

Hi Henry,

Thanks for your query. I have not found any issue with output html while using latest version of Aspose.Words for .NET 11.3.0. Please use the latest version. I have attached the output html with this post.

Please let us know, If you have any more queries.

Hi Henry,

Thanks for your inquiry.

Tamir is correct, this issue does not occur in the latest version of Aspose.Words. Most likely the character you are seeing in the output is the Byte Order Mark (BOM). This is used to help readers detect what encoding is used in the input text.

In the latest version we have changed the default settings in HtmlSaveOptions so that the default encoding does not export a BOM. You can achieve the same manually in older versions by using the following code.

HtmlSaveOptions options = new HtmlSaveOptions(SaveFormat.Html);
options.Encoding = new UTF8Encoding(false);
doc.Save(stream, options);

Please let us know if there are any other queries we can help with.

Thanks,

Thank you very much for reviewing my issue. However, all the solutions provided doesn’t work.
OK, let me explain what I tried so far.

  1. I fristly tried Tahir’s suggestion. I downloaded the latest Aspose.Words for .NET 11.3.0 and used it in our application. (Fig 1). The result of conversion is still same.
  2. I then tried Adam’s suggestion. I changed my code to become this
private string GetHTMLFromWord(string path)
{
    Stream wordStream = new MemoryStream(File.ReadAllBytes(path));
    Stream htmlStream = new MemoryStream();
    Aspose.Words.Document document = new Aspose.Words.Document(wordStream);
    document.RemoveMacros();
    HtmlSaveOptions options = new HtmlSaveOptions(Aspose.Words.SaveFormat.Html);
    options.Encoding = new UTF8Encoding(false);
    document.Save(htmlStream, options);
    htmlStream.Position = 0;
    StreamReader sr = new StreamReader(htmlStream); return sr.ReadToEnd();
}

Again, the result is still same.

  1. I loaded Tahir generated file into Notepad ++, the interesting thing I found is that file also has signle quote problem. Single quote character in English is code 0x27 (Fig2), but single quote character in that file is 0xe280 (Fig3). If I open Tahir’s file in IE, then looks like single quote shows up correctly (Fig4), but by copy the result into Notepad ++, I can see the single quote changed to another different character 0x92 (Fig5).
  2. I read though the explaination of BOM (Byte order mark) on Wikipedia, I don’t think the incorrect single quote character is BOM

Quote:
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream.

Hi Henry,

Please accept my apology for late response.

I have copied the contents of output html in notpad++/notepad/MS word and have not found the issue with single quote. I have also opened the html output in notepad++ and have not found issue. Please see the attached images.

I have installed the notepad++ 6.1.2 version at my side. It would be great, If you share your complete scenario so that we can reproduce the same issue at our end.

I don’t think there is any additional information from my end is missing. According to your screenshot, you just viewed the result html in notepad++ as text mode, you should install Hex plugin to view it as Hex mode. The problem is not displaying “single quote” incorrectly as text mode in notepad++. The problem is the single quote is not a standard single quote (this can be verified from Hex mode, please check the screenshot in my previous post and I also attached it again here), therefore, my following process on the generated html will choke on the “single quote”. Thanks.

Hi Henry,

Thanks for sharing the information. This is not an issue. Please find the MS word and Aspose output in attachment, both Hex and html output. Aspose.Word mimics the same behavior as MS word do.

Please let us know, If you have any more queries.

Thank you very much for all your help. Could you tell me what’ the encoding used for single quote character in the output, so we can probably manually replace it. Thanks.

Hi Henry,

The single quote look good in all text editors like notepad/notpad++/MS word etc. Unfortunately, I have not completely understand your query. It would be great, If you share some more detail about your query related to encoding.

Like I said before, the generated HTML from Aspose.words component will be consumed by our application to add more stuff and eventually wrapped in ASP.NET LiteralControl and rendered to end user’s browser. Because the unusual encoding for single quote, so the ultimate result page shows it as a question mark on screen. This is why I try to figure why this happen and how to remedy it.

Hi Henry,

Thanks for sharing the information. Please use the following code snippet to get correct single quote. Hope this helps you. Let us know, If you have any more queries.

Document doc = new Document(MyDir + "Test.docx");
MemoryStream stream = new MemoryStream();
doc.Save(stream, SaveFormat.Html);
UTF8Encoding enc = new UTF8Encoding();
string strOutPut = enc.GetString(stream.ToArray());

Following is the value of strOutPut:

This is Henry’s test

Thank you very much for all your help.