Unicode Symbols are Lost after HTML to PDF Conversion using .NET

Horst · September 24, 2021, 8:11am

Hi!

Unicode characters are not converted correctly (Aspose.Words 21.9.0)

    var inputFile = new FileInfo(@"Documents/helloUnicode.html");
    using var input = inputFile.OpenRead();

    var fileInfo = Aspose.Words.FileFormatUtil.DetectFileFormat(input);
    using var document = new PdfDocument(input, new PdfHtmlLoadOptions
    {
        InputEncoding = fileInfo.Encoding?.BodyName
    });

    using var output = File.OpenWrite("helloUnicode.pdf");

    // heavy check marks (U+2714 U+FE0F) and cross mark (U+274C) not printed in output pdf
    // empty box printed instead
    document.Save(output);

    // Hint: saving as Tiff image with new Aspose.Words.Document().Save(output, SaveFormat.Tiff); works!

Best regardshelloUnicode.zip (71.5 KB)

tahir.manzoor · September 24, 2021, 8:51am

@Horst

By using following simple code example, we have not faced the shared issue. So, please use it to get the desired output. We have attached the output PDF with this post for your kind reference. 21.9.pdf (56.0 KB)

Aspose.Words.Loading.HtmlLoadOptions htmlLoadOptions = new Aspose.Words.Loading.HtmlLoadOptions();
htmlLoadOptions.Encoding = Encoding.UTF8;
Aspose.Words.Document doc = new Aspose.Words.Document(MyDir + "helloUnicode.html", htmlLoadOptions);
doc.LayoutOptions.TextShaperFactory = HarfBuzzTextShaperFactory.Instance;
doc.Save(MyDir + "21.9.pdf");

Moreover, please note that Aspose.Words requires TrueType fonts when rendering document to fixed-page formats (JPEG, PNG, PDF or XPS). You need to install fonts that are used in your document on the machine where you are converting documents to PDF. Please refer to the following articles:

Using TrueType Fonts
Manipulating and Substitution TrueType Fonts