Invalid unicode character shown when converting html to word (but not pdf)

Hi there,

We have a problem when converting a html to word. Please see attached files. However, with the same html file, converting to pdf appears to be ok.

Code used for converting to docx:

com.aspose.words.Document input = new com.aspose.words.Document(inputStream);
input.save(os, SaveFormat.DOCX);

Code used for converting to pdf:

com.aspose.words.Document input = new com.aspose.words.Document(inputStream);
PdfSaveOptions options = new PdfSaveOptions();
input.save(os, options);

Thanks,
Tien

Hi Tien,

Thanks for your inquiry. I am afraid, I could not see any issue with your input/output documents, could you please clarify where the issue is? Please see the attached screenshot, your HTML, Word and PDF documents look all same.

Best regards,

Hi there,

On all of our machines and clients’, the word document appears like attached screenshot.

Cheers,
Tien

Hi Tien,

Thanks for your inquiry. Please configure your Word application to always open this file in Unicode (UTF-8) encoding mode. I hope, this helps.

Best regards,

Hi Awais,

Could you please confirm that you opened the attached docx in your MS Words and all appear ok (unlike the screenshot attached above)?

The screenshot shows that docx file (generated by aspose words) opened in MS Words 2010. You may judge it by reading [Compatibility Mode] on the title bar but I have no idea why Words shows that. The file was generated by aspose words and untouched.

If I misunderstood you, please clarify how I would open the file in correct encoding mode which will show all characters correctly.

Thanks,
Tien

Hi Awais,

My bad. I’ve just noticed that you opened the docx in Words 2013. However, as my screenshot shows, it doesn’t appear correctly in Words 2010. Could you please advise the solution in more details?

Thanks,
Tien

Hi Tien,

Thanks for your inquiry. It seems to be a problem with Aspose.Words generated output DOCX. I have logged this issue in our bug tracking system. The ID of this issue is WORDSNET-10430. Your thread has also been linked to this issue and you will be notified as soon as it is resolved. Sorry for the inconvenience.

Secondly, yes, MS Word 2010 does not display first two characters correctly but MS Word 2013 has no such problem. I think, as a workaround, you may save your document to first RTF format and then re-save it to DOCX format using Aspose.Words. MS Word 2010 will then open final DOCX correctly.

Best regards,

Thanks Awais,

I can confirm that issue doesn’t appear in RTF output. However, we cannot use the proposed workaround as there are some manipulations we have before outputing to docx.

I also noticed you have raised issue WORDSNET-10430 which is for .NET instead of Java where we have problem. Please verify.

Cheers,
Tien

Hi Tien,

Thanks for your inquiry. Please note that the latest version of Aspose.Words for Java is completely auto-ported from .NET, i.e. we do not write code for Aspose.Words for Java; it is generated out automatically from C# code of Aspose.Words for .NET. In your case, the issue which was logged with WORDSNET prefix, would be auto resolved for Java variant of Aspose.Words. Your problem will be fixed as soon as the linked issue is resolved.

Best regards,

Hi Awais,
Even though the raised bug is not yet fixed we have raised the same issue with the other special characters and now the issue along with word generation we have it in PDF as well.Please have alook at the attached documents.
As well the solution proposed to save it in RTF doesnt work in this case.

Please provide any suggestions .

Thanks
Ashwini

Hi Ashwini,

Thanks for your inquiry. But the problem can be observed even when you view worddocument.docx with Microsoft Word 2013 (please see attached screenshot). However, you may use the following code to fix this issue:

Document doc = new Document(MyDir + "worddocument.docx");
MemoryStream rtfStream = new MemoryStream();
doc.Save(rtfStream, SaveFormat.Rtf);
Document docx = new Document(rtfStream);
docx.Save(MyDir + "out.pdf");

Best regards,

The issues you have found earlier (filed as WORDSNET-10430) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

A post was split to a new topic: Aspose.Words.Pdf2Word.dll issue