Free Support Forum - aspose.com

Problem with scandinavian characters with new Aspose.Words

Hey,

I purchased new Aspose.Words with test licence to see what's new and how Aspose.Words converts docx files to html files. We've been using Aspose.Words major version 3, but now we have urge to get component what can convert MS Office Word 2007 files to html files. In testing I came across some unexpected behaviour. Aspose.Words didn't seem to recognize scandinavian characters, umlauts like ä and ö we're converted to some special character. Earlier version of Aspose.Words recognizes these characters fine when converting doc file to html file. Is problem in Aspose.Words or do I save or does my Word save docx files in format that no Aspose.Words can understand? Or do you even have support for scandinavian characters anymore? I wrote ordinary docx file with just plain text with Times New Roman font and tried to convert it to html file.

Matti Jantti, Finland

Hello Matti!<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for your inquiry and for suggesting upgrade.

We don’t drop supporting anything once supported. Maybe that’s specific to DOCX format. In this case re-saving in DOC format with MS Word could help. But I see that’s necessary for you to operate exactly with DOCX. Would you attach the problematic document here and show the code snippet you are using for conversion? We’ll investigate these materials and provide you more information.

Regards,

Hey,

Here's the file and code I used for test conversion. File is saved with Office 2003 to the docx format, can this affect conversion somehow? Should I test it with file created with Office 2007?

Dim WordDoku As Aspose.Words.Document

WordDoku = New Aspose.Words.Document(SaveToDir & SaveAsName)

WordDoku.Save(SaveToDir & FileNameWithoutExtension & ".htm", Aspose.Words.SaveFormat.Html)

Matti Jantti, Finland

Thank you for your materials.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I have tried the conversion and everything is okay on my side. Your code is also correct. I’m attaching my results. Please see whether you get the same. There are two files in the attachment: HTML with inline styles and with embedded CSS. Both give correct rendering in my Avant and IE7.

Please let me know if you see any differences in your results. If the results are equal then it is probably an environmental issues needed to be solved with browser tuning.

Regards,

Hey,

Ok. Thanks for quick answer. I have to check my development environment, if there's the reason for this miss behaviour. Is there possibility that settings at Windows Server can affect to the conversion?

Hey,

For some reason conversion at my development Windows Server for this file don't seem to work. I tried to convert same document with my two development workstations and another server and there conversion is OK.

Hello!<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

If so then it is an environmental issue. I cannot suppose anything. Are there any problems with other documents with umlauts on that computer? Have you tried opening what I attached? Aspose.Words outputs HTML in UTF-8 encoding. And Aspose.Words itself doesn’t depend on the version of operating system. So I expect that any well-formed document with umlauts in UTF-8 encoding won’t be rendered correctly. Maybe there are no fonts with umlauts installed there or some browser settings should be checked.

Regards,

Hey,

Yes, it is the problem with UTF-8 character set which Aspose.Words outputs. If I changed it in converted html to example ISO 8859-1, umlauts are rendered as supposed to. Is there some property or enumeration that can be used to change character set in conversion? I've read API, but cannot find anything like that.

Hello Matti!<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for investigation. Unfortunately there is no way to set output encoding in Aspose.Words. There is a known issue in our defect database:

#1552 - Allow to set the encoding for HTML export in SaveOptions.

As a workaround you can re-encode the output before using it. .NET IO classes provide encoding selection. Basically you create two streams for input and output on the appropriate files and set output encoding. Then copy the file line-by-line. Also you can find a ready utility for that purpose or get right with that environment to avoid re-encoding at all.

Regards,

Hey,

Thank you for your time. I already designed my own solution to work around this problem, there's just was no point to use it until I was sure that theres no way to change encoding in Aspose.Words.

Matti Jantti, Finland

Sorry for inconvenience with the need of workarounds.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

May I also ask you to post here some comments if you figure out _why_ this happens with UTF-8 HTML files? This could be interesting for other customers.

Thank you in advance,

The issues you have found earlier (filed as 1552) have been fixed in this update.