Bullet points in Private Use Area for Docx to Htm conversion

Hello,

I am using Aspose.Words in VB.Net to convert .docx documents to .htm files. Most formatting & text is working as expected, however bullet points in lists are coming through as characters in the Private Use Area.

Specifically, the default Word bullet points appear to be of unicode character U+00B7 (“Middle Dot”) in the .docx file during runtime. Once converted however, the .htm file has them as U+F0B7 which then fail to render correctly. A similar pattern appears for some other symbols used via Word’s bullet point dropdown, always increasing their value by F000. Manually defining a new bullet point on a per document basis does maintain its value, but we’d prefer not to rely on users having to consistently use this workaround.

I have already taken a look into the forums for similar issues in-which a commonly recommended method is to search through the document’s lists for the “Symbol” font with expected bullet point character, then replacing them with a valid bullet point. But even observing the change being logged at runtime, the final converted file still has those symbols F000 higher once it has been converted.

https://forum.aspose.com/t/docx-convert-to-rtf-bullet-and-numbering-missing/244029/5?u=modern.gov

This has been observed on Aspose.Words versions 23.12.0.0 & 24.11.0.0. I have attached a before & after file, along with the code block we have to handle this conversion below.

ConversionDocxHtm.zip (12.5 KB)

Dim oAsposeDocument As New Aspose.Words.Document(sSrcFilePath)
Dim oSaveOptions As New Aspose.Words.Saving.HtmlSaveOptions
Dim oParameters As Aspose.Words.Saving.SaveOutputParameters
oSaveOptions.ExportImagesAsBase64 = True
'oSaveOptions.ExportFontsAsBase64 = True ' #182630: Doesn't work
oSaveOptions.SaveFormat = SaveFormat.Html
'oSaveOptions.ExportXhtmlTransitional = True ' #182630: Doesn't work
oSaveOptions.ExportRoundtripInformation = False ' #182630: Remove Aspose noise from the .htm files for the purposes of converting back (as we don't need it)
'oSaveOptions.CssStyleSheetType = Saving.CssStyleSheetType.Inline ' #182630: Must be Inline, as other options are referenced/set in the <head> which is stripped out
oSaveOptions.Encoding = System.Text.Encoding.UTF8 '#159988 Force UTF8 encoding
oSaveOptions.ExportListLabels = Saving.ExportListLabels.AsInlineText ' #177129: Force any list formatting into <p> to maintain format once the <head> is stripped

For Each oList As Lists.List In oAsposeDocument.Lists
    For Each oLevel As Lists.ListLevel In oList.ListLevels
        Trace("Font name = " & oLevel.Font.Name & " NumberFormat = " & AscW(oLevel.NumberFormat).ToString()) ' "61623" = Word default bullet point + F000 as an int
        If oLevel.Font.Name.Equals("Symbol") And AscW(oLevel.NumberFormat) = 61623 Then
            oLevel.NumberFormat = "·" ' #182630: Sets correctly as logging below shows (other strings/hex for bullet points have worked too), but still has F000 added to it by the time the conversion is completed
            Trace("Updated_NumberFormat = " & AscW(oLevel.NumberFormat).ToString())
        End If
    Next
Next

oParameters = oAsposeDocument.Save(sDisplayFilePath, oSaveOptions)

Thanks,
Beau

@modern.gov Most likely you also need to change the font:

Document doc = new Document(@"C:\Temp\in.docx");

foreach (Aspose.Words.Lists.List lst in doc.Lists)
{
    foreach (ListLevel lvl in lst.ListLevels)
    {
        if (lvl.Font.Name == "Symbol" && lvl.NumberFormat == "\xF0B7")
        {
            lvl.Font.Name = "Segoe UI";
            lvl.NumberFormat = "\x2022";
        }
    }
}

HtmlSaveOptions opt = new HtmlSaveOptions();
opt.PrettyFormat = true;
opt.ExportListLabels = ExportListLabels.AsInlineText;
doc.Save(@"C:\Temp\out.html", opt);

out.zip (657 Bytes)

1 Like

Perfect, that works just as needed. Thank you. Out of curiosity, why would the font need to be changed for this specifically?

@modern.gov Different fonts has different set of supported characters. Windows “Symbol” font is a symbolic font (like “Webdings”, “Wingdings”, etc.) which uses Unicode PUA.

1 Like

Thanks again, as this works well for the default Word bullet point. What would be your recommendation to handle non-default bullet points such as those set by users within Word itself?

As our users can theoretically pick any symbol or letter for a bullet point, would this be doable in a dynamic manner (rather than attempting to see what the bullet point is and setting it on a per character basis)?

@modern.gov The ideal solution is to use the same fonts for rendering as those used by MS Word. In this case the document will be rendered the same as it looks in MS Word.

So is it a requirement to capture all instances of bullet points in the document during conversion and have code in place to handle every possibility individually in regards to character & font?

So far we have had a bullet point (Word default) we need to convert from ‘Symbol’ to ‘Segoe UI’ along with changing its unicode value, but for other bullet points a user may choose we instead might need to maintain the font?

As an example we have a document for testing (see AgendaItem.htm in attached .zip below) which has a ‘Wingdings’ bullet point for “Bullet 3”. Are we required to have code capture that explicit example, change it from the Private Use Area but maintain the font? Do we need to do the same for every possibility?

Viewing the document as is, it renders correctly but on our site it is appearing as below. Bullet 1 & 4 in the top section use the functionality you prior helped with, with the others rendering incorrectly:

Attached files of original .docx, and the .htm after conversion. Only code implemented for these files is that for the “Word default” bullet points.

DocxHtmFormatting.zip (8.3 KB)

@modern.gov Browsers uses the fonts available in the end user environment. So if the required font is not available on the end user side and the browser cannot find an alternative, the character might not be displayed properly. This behavior is out of Aspose.Words control.
As an option you can export font resource to HTML by setting HtmlSaveOptions.ExportFontResources. Please not in this case the required fonts must be available in the environment where the document is converted:

Document doc = new Document(@"C:\temp\in.docx");

HtmlSaveOptions opt = new HtmlSaveOptions();
opt.ExportFontResources = true;
opt.ExportFontsAsBase64 = true;
opt.PrettyFormat = true;

doc.Save(@"C:\Temp\out.html", opt);

out.zip (38.9 KB)