Hi Aspose Team,
when using Aspose.Words to convert HTML files to document i stepped across the following scenario. The HTML content is loaded from a file into a string and is converted to a memory stream containing the bytes of that string as UTF-8. The HTML contains a special character in two versions, as an HTML code and as a special character from the code page. The second version makes the troubles.
That memory stream is used to create a document with. When saving that document the special characters of that HTML file are wrong interpreted. Seems Aspose is using another code page internally ?!
Yet this whole thing worked fine until v11.1.0. With v11.2.0 this issue occurs. It does work when using UTF-7 or systems default code page.
This thread is less about reporting a bug but about asking back what changed. Are we on the safe side when using the systems code page defined by windows language settings when preparing the bytes for the Aspose document load process ?
Find below the code used and the HTML file attached.
Thanks for the support,
Oliver
string htmlContent = Encoding.Default.GetString(File.ReadAllBytes(filename));
MemoryStream mIn = new MemoryStream(Encoding.UTF8.GetBytes(htmlContent));
Document document = new Document(mIn);
document.Save("c:/myfile.docx");
mIn.Close();
Hi Oliver,
Thanks for your inquiry. I would like to refer you to the following post where Andrey has shed some light on the changes made in Aspose.Words v11.2.0.
https://forum.aspose.com/t/61227
Please let me know if I can be of any further assistance.
Best Regards,
Hi Awais,
thanks for answering. Yet i can't access that site due to permission restrictions.
One other question i'd have about the new version. The Aspose.Words .NET assembly does reference NUnit framework now (nunit.framework, Version=2.5.10.11092). Is that by accident that the release build contain unit tests and reference such assemblies ? Will that be removed in a future version again ? Not that it would make any issues but it'd feel better if you clear that out so there's no potential runtime exception when a Nunit class is called for whatever reason within Aspose.Words.
Thanks,
Oliver
Hi Oliver,
Thanks for your inquiry. Here is the quote from the link mentioned in my previous post.
Andrey Soldatov:In the last release we have changed mechanism of Encoding detection for Html.
Detection has become much smarter but we still analyse BOM at the beginning and an encoding mentioned in special Html and Xml tags inside Html.
We don’t perform syntax analysis of human-readable text in Html.
Unfortunately, your Html files don’t use any standard way to denote that they are UTF-8. For such files we use Encoding.GetEncoding(CodePage.WindowsLatin1CodePage) default encoding.
Secondly, yes, there was a bug in our release process which meant a refrence to NUnit was left in the compiled DLLs. We fixed this issue and replaced the download with a proper version a while ago so you must have downloaded the DLLs before that. There shouldn’t be any reference to NUnit now through any public API. Could you please re-download 11.2.0 version of Aspose.Words from the following link and let us know how it goes on your side.
I hope, this will help.
Best Regards,
Hi Awais,
thanks for the infos. I can confirm that the NUnit dependency is gone after a fresh download of the assemblies. All questions answered...
Thanks,
Oliver
Hi Oliver,
Just a quick query, how did you fix the issue you were having? Did you have to adjust your input HTML?
Thanks,
Hi Adam,
the HTML code bytes are decoded to string (for further non-code page relevant things) using UTF-8. The string is then encoded to Latin1 bytes. Adjustments on the HTML code are not done.
Something like this:
public static string DoSomething(byte[] data)
{
string htmlContent = Encoding.UTF8.GetString(data);
Encoding enc = Encoding.Default;
try
{
//Load latin1 encoding
enc = Encoding.GetEncoding("ISO-8859-1");
}
catch (ArgumentException e) { /*handle*/ }
return Convert.ToBase64String(enc.GetBytes(htmlContent));
}
Kind regards,
Oliver
Hi Oliver,
Thanks for clarifying that with us.
It’s great everything is working as expected. Please feel free to ask any time you need any help.
Thanks,