Bullet points in Private Use Area for Docx to Htm conversion

Hello,

I am using Aspose.Words in VB.Net to convert .docx documents to .htm files. Most formatting & text is working as expected, however bullet points in lists are coming through as characters in the Private Use Area.

Specifically, the default Word bullet points appear to be of unicode character U+00B7 (“Middle Dot”) in the .docx file during runtime. Once converted however, the .htm file has them as U+F0B7 which then fail to render correctly. A similar pattern appears for some other symbols used via Word’s bullet point dropdown, always increasing their value by F000. Manually defining a new bullet point on a per document basis does maintain its value, but we’d prefer not to rely on users having to consistently use this workaround.

I have already taken a look into the forums for similar issues in-which a commonly recommended method is to search through the document’s lists for the “Symbol” font with expected bullet point character, then replacing them with a valid bullet point. But even observing the change being logged at runtime, the final converted file still has those symbols F000 higher once it has been converted.

https://forum.aspose.com/t/docx-convert-to-rtf-bullet-and-numbering-missing/244029/5?u=modern.gov

This has been observed on Aspose.Words versions 23.12.0.0 & 24.11.0.0. I have attached a before & after file, along with the code block we have to handle this conversion below.

ConversionDocxHtm.zip (12.5 KB)

Dim oAsposeDocument As New Aspose.Words.Document(sSrcFilePath)
Dim oSaveOptions As New Aspose.Words.Saving.HtmlSaveOptions
Dim oParameters As Aspose.Words.Saving.SaveOutputParameters
oSaveOptions.ExportImagesAsBase64 = True
'oSaveOptions.ExportFontsAsBase64 = True ' #182630: Doesn't work
oSaveOptions.SaveFormat = SaveFormat.Html
'oSaveOptions.ExportXhtmlTransitional = True ' #182630: Doesn't work
oSaveOptions.ExportRoundtripInformation = False ' #182630: Remove Aspose noise from the .htm files for the purposes of converting back (as we don't need it)
'oSaveOptions.CssStyleSheetType = Saving.CssStyleSheetType.Inline ' #182630: Must be Inline, as other options are referenced/set in the <head> which is stripped out
oSaveOptions.Encoding = System.Text.Encoding.UTF8 '#159988 Force UTF8 encoding
oSaveOptions.ExportListLabels = Saving.ExportListLabels.AsInlineText ' #177129: Force any list formatting into <p> to maintain format once the <head> is stripped

For Each oList As Lists.List In oAsposeDocument.Lists
    For Each oLevel As Lists.ListLevel In oList.ListLevels
        Trace("Font name = " & oLevel.Font.Name & " NumberFormat = " & AscW(oLevel.NumberFormat).ToString()) ' "61623" = Word default bullet point + F000 as an int
        If oLevel.Font.Name.Equals("Symbol") And AscW(oLevel.NumberFormat) = 61623 Then
            oLevel.NumberFormat = "·" ' #182630: Sets correctly as logging below shows (other strings/hex for bullet points have worked too), but still has F000 added to it by the time the conversion is completed
            Trace("Updated_NumberFormat = " & AscW(oLevel.NumberFormat).ToString())
        End If
    Next
Next

oParameters = oAsposeDocument.Save(sDisplayFilePath, oSaveOptions)

Thanks,
Beau

@modern.gov Most likely you also need to change the font:

Document doc = new Document(@"C:\Temp\in.docx");

foreach (Aspose.Words.Lists.List lst in doc.Lists)
{
    foreach (ListLevel lvl in lst.ListLevels)
    {
        if (lvl.Font.Name == "Symbol" && lvl.NumberFormat == "\xF0B7")
        {
            lvl.Font.Name = "Segoe UI";
            lvl.NumberFormat = "\x2022";
        }
    }
}

HtmlSaveOptions opt = new HtmlSaveOptions();
opt.PrettyFormat = true;
opt.ExportListLabels = ExportListLabels.AsInlineText;
doc.Save(@"C:\Temp\out.html", opt);

out.zip (657 Bytes)

1 Like

Perfect, that works just as needed. Thank you. Out of curiosity, why would the font need to be changed for this specifically?

@modern.gov Different fonts has different set of supported characters. Windows “Symbol” font is a symbolic font (like “Webdings”, “Wingdings”, etc.) which uses Unicode PUA.

1 Like