Word to PDF Conversion Encoding Error with Certain Unicode Characters

gordonca · February 12, 2019, 3:12pm

We are having an issue where certain unicode characters are not rendering correctly when saved to PDF. All these characters seem to be from the Latin Extended B unicode block.

I have verified this is not a font issue by turning of font embedding altogether when saving to PDF. When I open the PDF on my laptop which has Arial installed, I the characters appear as squares. When I open the Word document, the characters are render in arial correctly.

See attached zipfile with:

input.docx
output.pdf
actual.pdfexample documents.zip (413.3 KB)

Here is our Aspose code (written in python with jpype):

def convert_to_pdf(src_path, dest_path):
    Document = jpype.JClass('com.aspose.words.Document')
    SaveFormat = jpype.JClass('com.aspose.words.SaveFormat')
    Color = jpype.JClass('java.awt.Color')

    doc = Document(src_path)
    # Fix Pink Background Issue
    # - Right now we are going to always set background color to white.
    # - If that causes problems then we could target specific colors
    #   that are causing problems such as:
    #       pink = Color(255, 153, 204)

    white = Color(255, 255, 255)
    doc.setPageColor(white)

    PdfSaveOptions = jpype.JClass('com.aspose.words.PdfSaveOptions')
    options = PdfSaveOptions()
    options.setEmbedFullFonts(True)

    doc.save(dest_path, options)

Server is on Ubuntu 16.

awais.hafeez · February 13, 2019, 3:07am

@gordonca,

Please also ZIP and attach your input.docx file here for testing. We will then investigate the issue on our end and provide you more information.

gordonca · February 13, 2019, 3:19am

@awais.hafeez Sorry forgot to add it! Attached is a zip with all three documents.

example documents (v2).zip (422.6 KB)

awais.hafeez · February 13, 2019, 9:19am

@gordonca,

For this particular document, please make sure the following Fonts are installed on the machine where you are performing Word to PDF conversion:

Arial
Times New Roman

We have these two fonts installed on our end and see no square boxes during Word to PDF conversion. Please see awjava-19.1.pdf (36.4 KB)

You can also try using “Arial Unicode” font which generally contains almost all glyphs from different languages. Hope, this helps.

gordonca · February 13, 2019, 12:49pm

@awais.hafeez

What OS did you run your test on? I have those fonts installed as you can see in actual.pdf that there are other glyphs from those fonts displaying.

I will try to copy over another version of those fonts and see what happens.

gordonca · February 13, 2019, 4:13pm

Ok we have determined that the versions of arial and times on Ubuntu are not equivalent to the versions of those fonts that ship with Windows. I am still puzzled as to why when font embedding was disabled we still saw the square boxes.

awais.hafeez · February 13, 2019, 4:25pm

@gordonca,

We tested on Windows 10 machine. Yes, you can copy those fonts from Windows machine to your Ubuntu to fix this issue. Also, please refer to the following article.
True Type Fonts