UTF-16 problem converting doc to html in Websphere/Unix with aspose words java

Hello
I have a problem using Aspose in a corporate environment.
I convert to HTML correctly imported Word documents from a webapp running on JBoss/Windows XP (developement environement) however running on WebSphere/Unix (production environement) things get bad as the format returned is UTF-16.

Here is the code used :

ByteArrayInputStream stream = new ByteArrayInputStream(byteArray);
Document documentAspose = new Document(stream);
ByteArrayOutputStream out = new ByteArrayOutputStream();

if (format == SaveFormat.HTML)
{
    documentAspose.getSaveOptions().setHtmlExportEncoding(Charset.forName("UTF8"));
}
else if (format == SaveFormat.TEXT)
{
    documentAspose.getSaveOptions().setTxtExportEncoding(Charset.forName("UTF8"));
}

documentAspose.save(out, format);
System.out.println(out);

And the returned wrong format (first few lines) :

ÿþ<�h�t�m�l�>�<�h�e�a�d�>�<�m�e�t�a�
�h�t�t�p�-�e�q�u�i�v�=�"�C�o�n�t�e�n�t�-�T�y�p�e�"�
�c�o�n�t�e�n�t�=�"�t�e�x�t�/�h�t�m�l�;�
�c�h�a�r�s�e�t�=�u�t�f�-�1�6�"�>�<�/�m�e�t�a�>�<�m�e�t�a�
�h�t�t�p�-�e�q�u�i�v�=�"�C�o�n�t�e�n�t�-�S�t�y�l�e�-�T�y�p�e�"�
�c�o�n�t�e�n�t�=�"�t�e�x�t�/�c�s�s�"�>�<�/�m�e�t�a�>

Normally the format should be UTF-8 and should not render those characters.

Do you have any clues ?
Thanks for your kind help.

Hi

Thanks for your request. Could you please attach sample document here for testing? I will check the issue and provide you more information.
Have you tried saving HTML to file? Is the produced HTML file also damaged?
Best regards.

Here is a very simple word document used for testing.

The return of importing this file with Websphere6.1/Unix is this :

ÿþ< h t m l > < h e a d > < m e t a h t t p - e q u i v = " C o n t e n t - T y p e " c o n t e n t = " t e x t / h t m l ; c h a r s e t = u t f - 1 6 " > < / m e t a > < m e t a h t t p - e q u i v = " C o n t e n t - S t y l e - T y p e " c o n t e n t = " t e x t / c s s " > < / m e t a > < m e t a n a m e = " g e n e r a t o r " c o n t e n t = " A s p o s e . W o r d s f o r J a v a 3 . 3 . 0 . 0 " / > < t i t l e > < / t i t l e > < / h e a d > < b o d y > < d i v > < p s t y l e = " m a r g i n : 0 p t " > < s p a n s t y l e = " f o n t - f a m i l y : ’ T i m e s N e w R o m a n ’ ; f o n t - s i z e : 1 2 p t " > V e r y S i m p l e D o c u m e n t < / s p a n > < / p > < / d i v > < / b o d y > < / h t m l > 

The problem is the additionnal “empty” character present between every
normal character. In my logs it renders as a space, in Firefox it
renders as the small icon as displayed in my previous post.

I tested several encodings in the line "documentAspose.getSaveOptions().setHtmlExportEncoding(Charset.forName("ISO - 8859 - 1 “));”

but it always results in the charset utf-16 in the HTML in Websphere whereas in JBoss/Windows the charset matches the value given to the method setHtmlExportEncoding.

Thanx for your support.

Hi

Thank you for additional information. I do not think that Aspose.Words saved the document improperly. I think that stream is converted to string improperly. Have you tried specifying encoding explicitly when convert stream to string? See the following code:

Document documentAspose = new Document("C:\\Temp\\VerySimpleDoc.doc");
ByteArrayOutputStream out = new ByteArrayOutputStream();
documentAspose.getSaveOptions().setHtmlExportEncoding(Charset.forName("UTF8"));
documentAspose.save(out, SaveFormat.HTML);
System.out.println(out.toString("UTF8"));

Also, have you tried saving HTML to document to file? Does the output file also look damaged?

Document documentAspose = new Document("C:\\Temp\\VerySimpleDoc.doc");
documentAspose.save("C:\\Temp\\out.html", SaveFormat.HTML);

Best regards.

Hi

I tried saving the document to file and it does not look damaged, however the encoding I specify in

"documentAspose.getSaveOptions().setHtmlExportEncoding(Charset.forName("
UTF8 "));"

is not taken into account and the meta-inf is still utf-16.
Since the file is encoded into utf-16 it renders fine by itself.
However that does not change my problem on the ByteArrayOutputStream. I think there is a call to String.getBytes() somewhere in your code : this uses the platform encoding instead of the encoding specified in the save options.

Anyway I could manage the situation by using :

Document documentAspose = new Document("C:\Temp\VerySimpleDoc.doc");
ByteArrayOutputStream out = new ByteArrayOutputStream();
documentAspose.getSaveOptions().setHtmlExportEncoding(Charset.forName("UTF-16")); // Not taken into account in Websphere/Unix but used in JBoss/Windows
documentAspose.save(out, SaveFormat.HTML);
String str = out.toString("UTF-16");
str = str.replaceAll("\u2019", "'"); // word curly '
str = str.replaceAll("\u20ac", "Ç"); // euro symbol

Still a workaround but enough for me.

Thank you a lot for your support.

You might want to take a look to those String.getbytes() and OutputStream.toString() methods in your code : with no encoding specified they use the platform one.

Hi

Thank you for additional information. But I am still unable to reproduce the problem. I tried running your code in Windows and Linux environments. As I can see all works fine on my side. I use the latest version of Aspose.Words for testing (4.0.2). You can download it from here:
https://releases.aspose.com/words/java
Best regards.