Free Support Forum - aspose.com

Different bullet types are translated differently from RTF to HTML depending on character set

I am translating RTF documents to HTML with Aspose.Words for Java.

One RTF document uses the UTF8 character 8226 to show a bullet. in RTF it is encoded \u8226. Using windows-1252 encoding translates this correctly into HTML in Aspose. Otherwise it shows up as double quotes (") in other character sets.

The other bullet is a paragraph formatting in RTF using \pnlvlblt. Using character encoding ISO-8859-1 translates this correctly into HTML in Aspose. Otherwise it shows up as a question mark (?) in other character sets.

Attached is a document that has both types of bullets in it. Typically a document will only have one or the other type of bullets.

Here is the code I am using for translation.

Document rtfdocument = new Document(in);
HtmlSaveOptions hso = new HtmlSaveOptions(SaveFormat.HTML);
hso.setPrettyFormat(true);
hso.setAllowNegativeIndent(true);
hso.setCssStyleSheetType( CssStyleSheetType.INLINE); //  .EMBEDDED);
hso.setEncoding(Charset.forName("ISO-8859-1"));  // windows-1252 is a subset of ISO-8859-1
hso.setExportHeadersFootersMode(ExportHeadersFootersMode.NONE);
hso.setExportImagesAsBase64(true);
rtfdocument.save(htmlout, hso);

Is there anything I can do to make Aspose handle both bullet types with just one character set?

Thanks,

Jared

Hi Jared,


Thanks for your inquiry. In case you are using an older version of Aspose.Words, I would suggest you please upgrade to the latest version (v13.2.0) from here:
http://www.aspose.com/community/files/51/.net-components/aspose.words-for-.net/entry448722.aspx

In your case, I suggest you please use default encoding which is UTF-8 Charset. HtmlSaveOptions uses UTF-8 as default encoding. This will sever your purpose.Please let us know if you have any more queries.

<span style=“font-size:
10.0pt;font-family:“Courier New”;color:#2B91AF;mso-no-proof:yes”>
Document<span style=“font-size:10.0pt;font-family:“Courier New”;mso-no-proof:yes”>
rtfdocument = new Document(MyDir

  • “both_bullets.rtf”);<o:p></o:p>

HtmlSaveOptions hso = new HtmlSaveOptions(SaveFormat.HTML);

hso.setPrettyFormat(true);

hso.setAllowNegativeIndent(true);

hso.setCssStyleSheetType(CssStyleSheetType.INLINE); // .EMBEDDED);

//hso.setEncoding(Charset.forName("UTF-8")); // windows-1252 is a subset of ISO-8859-1

hso.setExportHeadersFootersMode(ExportHeadersFootersMode.NONE);

hso.setExportImagesAsBase64(true);

rtfdocument.save(MyDir + "out-java.html", hso);


Thank you so much for your quick reply and answer. Using the new Aspose.Words and using UTF-8 worked great. Now I’m having trouble with a third bullet type in the attached document. I need to do RTF to HTML, but the bullets are showing up as question marks. The last fix worked for the previous document, attached to my first post, with the windows-1252 character set, which would be ideal for us if it also worked for this third bullet type. We are streaming it to a system that does not handle UTF-8. That said, if you have a way to get this third bullet type working with either character set, we can do some filtering in the stream before sending it to our external process.


Thanks,

Jared

Hi Jared,


Thanks for your inquiry. I have tested the scenario and have not found the shared issue while using latest version of Aspose.Words for Java. Please upgrade to the latest version (v13.3.0) from here and let us know how it goes on your side.

If the problem still remains, please share the browser (version) in which you are checking the output html. I will investigate the issue on my side and provide you more information.