HTML to PDF conversion - missing Chinese characters

Mariusz_Pala · June 5, 2017, 7:17am

Hi,

We’re trying to convert the attached HTML into the PDF (sample output attached), but in the result PDF some Chinese characters are missing although those are visible correctly in the HTML source file.

Can you tell us what needs to be done to correct it?

The code we use is:

HtmlLoadOptions options = new HtmlLoadOptions();

options.setInputEncoding(“UTF-8”);

Document doc = new com.aspose.pdf.Document(htmlFile,options);

doc.save(pdfFile);

Thanks,

Mariusz

Mariusz_Pala · June 5, 2017, 7:20am

Screenshot attached.

asad.ali · June 5, 2017, 1:37pm

Hi Mariusz,

Thanks for contacting support.

I have tested the scenario with following code snippet while using Aspose.Pdf for Java 17.4 and was unable to notice any issue. The output file was generated correctly and there were no missing Chinese Characters in it.

HtmlLoadOptions htmloptions = **new** HtmlLoadOptions(dataDir);
htmloptions.setInputEncoding("UTF-8");
Document pdf = **new** Document(dataDir + "CN-B-2016-00093.htm", htmloptions);
pdf.save(dataDir + "CN-B-2016-00093_out.pdf");

For your reference, I have also attached an output, generated by above code. We will appreciate if you please share some more information (i.e API Version, OS Version, JDK Version, etc), so that we can test the scenario again in our environment and address it accordingly.

Best Regards,

Mariusz_Pala · June 6, 2017, 2:21am

We’re using Aspose.PDF for Java 17.4, it is running on Tomcat using JDK 1.7, Windows Server 2012. It happens every time.

Mariusz_Pala · June 6, 2017, 2:52am

I tried running the same code on Mac OS/JDK 1.6 and the output PDF is correct. What could be the reason it doesn’t work correctly on Windows? The same issue we observe on our CentOS environment.

asad.ali · June 6, 2017, 12:45pm

Hi Mariusz,

Thanks for adding more details to the scenario.

I have tested the scenario again on CentOs 7 x64 based system and observed the missing Chinese Characters issue. As it seemed that fonts which were required to display Chinese Characters, were not installed in the OS, so I tried installing Arial MS Unicode and Times New Roman but it did not make any improvement in the resultant file.

I have logged an issue as PDFJAVA-36805 in our issue tracking system. We will further investigate the issue in details and keep you informed with the status of its correction. Please be patient and spare us little time. Moreover concerning to the testing on Windows Server, we are checking details over our end and will share our findings with you soon.

We are sorry for the inconvenience.

Best Regards,

rhh4 · July 2, 2018, 5:51am

Hi,

We are facing similar issue while converting HTML to PDF in Chinese or Japanese language. On CentOS, the resultant PDF file contains box characters in it.

Can we know the status of the issue PDFJAVA-36805?

Best Regards,
Rakshitha

asad.ali · July 2, 2018, 12:12pm

@rhh4

Thanks for your inquiry.

The earlier logged issue was resolved. Since the logged issue was moved to priority support, resolution update was shared in respective forums. We found that this issue could be fixed by installing the font Arial Unicode MS that has all required characters.

Please, install the font in default font directory or set a path to the font using the Document.addLocalFontPath() function. Please also make sure to use latest version of Aspose.PDF for Java i.e. 18.6. In case of any further assistance, please feel free to let us know.

rhh4 · July 3, 2018, 11:48am

Hi Ali,

This worked. Thanks for your inputs.

Best Regards,
Rakshitha

asad.ali · July 3, 2018, 6:27pm

@rhh4

Thanks for your kind feedback.

It is good to know that your issue has been resolved by suggested approach. Please keep using our API and in case of any further assistance or inquiry, please feel free to create a new topic.