Hello,
I am using Aspose.Words in VB.Net to convert .docx documents to .htm files. Most formatting & text is working as expected, however bullet points in lists are coming through as characters in the Private Use Area.
Specifically, the default Word bullet points appear to be of unicode character U+00B7 (“Middle Dot”) in the .docx file during runtime. Once converted however, the .htm file has them as U+F0B7 which then fail to render correctly. A similar pattern appears for some other symbols used via Word’s bullet point dropdown, always increasing their value by F000. Manually defining a new bullet point on a per document basis does maintain its value, but we’d prefer not to rely on users having to consistently use this workaround.
I have already taken a look into the forums for similar issues in-which a commonly recommended method is to search through the document’s lists for the “Symbol” font with expected bullet point character, then replacing them with a valid bullet point. But even observing the change being logged at runtime, the final converted file still has those symbols F000 higher once it has been converted.
https://forum.aspose.com/t/docx-convert-to-rtf-bullet-and-numbering-missing/244029/5?u=modern.gov
This has been observed on Aspose.Words versions 23.12.0.0 & 24.11.0.0. I have attached a before & after file, along with the code block we have to handle this conversion below.
ConversionDocxHtm.zip (12.5 KB)
Dim oAsposeDocument As New Aspose.Words.Document(sSrcFilePath)
Dim oSaveOptions As New Aspose.Words.Saving.HtmlSaveOptions
Dim oParameters As Aspose.Words.Saving.SaveOutputParameters
oSaveOptions.ExportImagesAsBase64 = True
'oSaveOptions.ExportFontsAsBase64 = True ' #182630: Doesn't work
oSaveOptions.SaveFormat = SaveFormat.Html
'oSaveOptions.ExportXhtmlTransitional = True ' #182630: Doesn't work
oSaveOptions.ExportRoundtripInformation = False ' #182630: Remove Aspose noise from the .htm files for the purposes of converting back (as we don't need it)
'oSaveOptions.CssStyleSheetType = Saving.CssStyleSheetType.Inline ' #182630: Must be Inline, as other options are referenced/set in the <head> which is stripped out
oSaveOptions.Encoding = System.Text.Encoding.UTF8 '#159988 Force UTF8 encoding
oSaveOptions.ExportListLabels = Saving.ExportListLabels.AsInlineText ' #177129: Force any list formatting into <p> to maintain format once the <head> is stripped
For Each oList As Lists.List In oAsposeDocument.Lists
For Each oLevel As Lists.ListLevel In oList.ListLevels
Trace("Font name = " & oLevel.Font.Name & " NumberFormat = " & AscW(oLevel.NumberFormat).ToString()) ' "61623" = Word default bullet point + F000 as an int
If oLevel.Font.Name.Equals("Symbol") And AscW(oLevel.NumberFormat) = 61623 Then
oLevel.NumberFormat = "·" ' #182630: Sets correctly as logging below shows (other strings/hex for bullet points have worked too), but still has F000 added to it by the time the conversion is completed
Trace("Updated_NumberFormat = " & AscW(oLevel.NumberFormat).ToString())
End If
Next
Next
oParameters = oAsposeDocument.Save(sDisplayFilePath, oSaveOptions)
Thanks,
Beau