Best practice for creating a PDF with potential UTF-8 characters (Java)

Hello Support Forum,

we are dealing with internationalization issues that I think other enterprise applications might also struggle with - and wonder what the best practice from Aspose’s point of view is:

The requirement is to create a dynamic PDF that contains dynamic content (based on user input) where that content could also contain Asian characters (Chinese, Japanese in particular).

In contrary to converting e.g. a Word document into PDF, where an intelligent font substitution process exists, there does not seem to be something similar for creating a PDF from scratch (see Chinese characters issw - #7 by tilal.ahmad).

So we followed the advice to use the Font#doesFontContainAllCharacters method and an own order in which we check potential fonts.
As we have to support Windows and Linux backends we chose the Google NotoSans font with its Asian language supporting versions like NotoSansCJKTCRegular.otf etc.

Due to a bug in NotoSansCJKTCRegular, the order of checking the fonts is the following:

  • “NotoSansRegular”
  • // as NotoSansCJKTCRegular renders dots and dashes incorrectly, we use the more specific fonts first
  • “NotoSansCJKjp-Regular”
  • “NotoSansSC-Regular”
  • “NotoSansTC-Regular”
  • “NotoSansCJKTCRegular”
  • “NotoSans-Italic”

Unfortunately, the result (in case it contains some Asian characters) is most of the time a PDF file that does not display all characters correctly.
And, this approach seems awkward - I would have really expected Aspose to deal with such burden and not having to fix these problems in our application code.

But maybe creating PDF files from scratch via Aspose.PDF is not the best way here?
Are other fonts better than NotoSans?
Would it be better to create a Word document and then convert it to PDF?
Could you give some advice please?

Thanks a lot in advance,
Stefan Raubal

@stefan.raubal

Aspose.PDF looks for suitable fonts (in case no font is specified in code) while generating a PDF document. Basic fonts which are necessary to be installed in the system are MS TrueType Fonts and Arial Unicode MS font has maximum support for non-English characters. If there are some cases where you are facing issue while rendering some non-English characters even when supported fonts are present, please share some sample code snippet with us so that we can further investigate the case and share our feedback with you.

There are other fonts as well which work fine with Aspose.PDF in order to render Chinese/Japanese characters like SimSun. You can also try installing it and specifying it while dealing with these languages.

Aspose.PDF also offers font substitution for example the below example:

private static void OnFontSubstitution(Pdf.Text.Font oldFont, Pdf.Text.Font newFont)
{
 document.FontSubstitution += OnFontSubstitution;
 Console.WriteLine($"=> Font '{oldFont.FontName}' was substituted with another font '{newFont.FontName}'");
}

You can also try Word to PDF Conversion using Aspose.Words if that suits your needs and you are achieving better results using this approach. Both Aspose.PDF and Aspose.Words API are designed to deal with different file format. As compared to DOC/DOCX/Word Files, PDF is completely different format and has complex document structure. We resolve issues related to fonts and rendering of characters once they are reported in order to improve API more. Please feel free to share details of the cases along with code snippet and sample files so that we can further address the issue accordingly.

Hi Ali,

thanks for your reply!

The main reason that stops us from using fonts you mentioned is legal issues - we are selling a Java based cross platform product that runs on Linux and Windows servers. While Windows servers might have installed some of these fonts, the Linux distributions have a harder time legally using these fonts.

As we already implemented quite some functionality using PDF creation from scratch via Aspose.PDF, switching to an Aspose.Word based solution would require real effort.

Can you confirm that if you don’t specify a font name during PDF creation with Aspose.PDF and you specify e.g. the NotoSansCJKTCRegular font’s folder as local font path (it is also an installed system font on that machine), that Aspose will NOT use it for Asian characters? (Because that’s what I experience.)

Thanks a lot for your insights,
Stefan

@stefan.raubal

That should not be the case. If a font is installed in the system and specified as well while generating PDF from scratch, the API should use it. If you are experiencing the opposite, please share sample code snippet and font file with us so that we can also try to replicate the issue at our end and address it accordingly.

Hi Ali,

thanks for your response. I really appreciate the constructive discussion here! :slight_smile:

Here’s an example code - independent whether I use the first lines that set up FontRepository or not (the ideal scenario would be to stay independent of the random system fonts installed at the server - we want to enforce the usage of NotoSans), the result is always based on MSGothic (see attachment) which cannot display all characters properly!

AsianCharactersTest-screenshot.png (23.6 KB)

    FontRepository.getSources().clear();
    // Neither of the two versions to define the font for Aspose made a difference:
    // FontRepository.getSources().add(new FileFontSource("C:\\path_to_font\\NotoSansCJKTCRegular.otf"));
    FontRepository.addLocalFontPath("C:\\path_to_font");

    Document document = new DocumentFactory().createDocument();

    Page page = document.getPages().add();
    page.setPageSize(597.6, 842.4);

    FloatingBox titleFloatingBox = new FloatingBox();
    titleFloatingBox.setTop(85);
    TextFragment titleText = new TextFragment("发明名称 (通过相似专利搜索、内部专利检索或其他来源) 。");
    titleText.getTextState().setCharacterSpacing(5f);
    titleText.getTextState().setFontSize(20.5f);
    titleText.setHorizontalAlignment(HorizontalAlignment.Center);
    titleFloatingBox.getParagraphs().add(titleText);

    page.getParagraphs().add(titleFloatingBox);

    document.save(new FileOutputStream("AsianCharactersTest.pdf"));

What am I doing wrong?

Kind regards, Stefan

@stefan.raubal

We were able to notice that API was not selecting/choosing the font from font folder which was set at the start. We used another way to set the font folder and it did not help either:

java.util.List<String> fontpaths = new java.util.ArrayList<String>();
fontpaths.add(dataDir);
FontRepository.setLocalFontPaths(fontpaths);

Therefore, we have logged an investigation ticket as PDFJAVA-40721 in our issue tracking system. We will analyze this behavior of the API in details and keep you posted with the status of ticket resolution.

Furthermore, we finally were able to render the Asian Characters using target font by specifying the font at TextFragment level like in the below code:

Document document = new Document();

Page page = document.getPages().add();
page.setPageSize(597.6, 842.4);

FloatingBox titleFloatingBox = new FloatingBox();
titleFloatingBox.setTop(85);
TextFragment titleText = new TextFragment("发明名称 (通过相似专利搜索、内部专利检索或其他来源) 。");
titleText.getTextState().setCharacterSpacing(5f);
titleText.getTextState().setFontSize(20.5f);
titleText.getTextState().setFont(FontRepository.openFont(dataDir + "NotoSansCJKjp-Regular.otf"));
titleText.setHorizontalAlignment(HorizontalAlignment.Center);
titleFloatingBox.getParagraphs().add(titleText);

page.getParagraphs().add(titleFloatingBox);

document.save(dataDir + "AsianCharactersTest.pdf");

You can use this workaround at the moment in order to use NotoSans Fonts to render Chinese/Japanese characters in the PDF.

We are sorry for the inconvenience.

Thanks Ali,

Good to see that you could reproduce the issue and want to improve the product! Great!
(Although I know it will take ~ a year until the fix might be released.)

Your suggested solution is basically what we also use as work around.
As we don’t know whether the content is Chinese (Simpliefied or even Traditional) or Japanese, we use this in combination with “Font.doesFontContainAllCharacters” to find which NotoSansCJK is the required one.

Looking forward to Aspose doing these things internally!
(This thread is done for me until the solution is available.)

Kind regards,
Stefan

@stefan.raubal

The resolution time of the issue depends upon its complexity and number of issues logged prior to it. We resolve every logged issue however, they are resolved on first come first serve basis in free support model. We really regret that the suggested workaround cannot work in the scenarios you have at your side.

We have recorded your concerns and will surely consider them during investigation of the logged ticket and let you know as soon as we have definite updates regarding resolution of the ticket.

We humbly apologize for the inconvenience caused.

The issues you have found earlier (filed as PDFJAVA-40721) have been fixed in Aspose.PDF for Java 21.8.

Thanks for the notification and the quick fix!
Do you have any details how the behavior exactly was improved?

Kind regards, Stefan

@stefan.raubal

We have implemented com.aspose.pdf.TextDefaults class to define text subsystem defaults used in pdf generator. And choose one of 4 strategies using com.aspose.pdf.TextDefaults#setDefaultFontStrategy.

Default strategy is com.aspose.pdf.TextDefaults.DefaultFontStrategy#SystemFont

For example, you can use the following code to choose the default font:

TextDefaults.setDefaultFontStrategy(TextDefaults.DefaultFontStrategy.PredefinedFont);
Font font = FontRepository.openFont(dataDir + "Noto-unhinted/NotoSansCJKjp-Bold.otf");
TextDefaults.setPredefinedFont(font);

Or use this strategy -(com.aspose.pdf.TextDefaults.DefaultFontStrategy#TheFirstSuitableFoundFont) to search the first suitable font between all registered fonts, found by aspose.pdf.

TextDefaults.setDefaultFontStrategy(TextDefaults.DefaultFontStrategy.TheFirstSuitableFoundFont);

Also, if you want to use the only specific fonts among all - you can select them and use the following strategy:

com.aspose.pdf.TextDefaults.DefaultFontStrategy#ListOfFonts

In this case for Chinese letters will be selected Chinese font and for Arabic letters - the font will be Arabic.

TextDefaults.setDefaultFontStrategy(TextDefaults.DefaultFontStrategy.ListOfFonts);
        Font font1 = FontRepository.openFont(dataDir + "Noto-unhinted/NotoSansCJKjp-Regular.otf");
        Font font3 = FontRepository.openFont(dataDir + "Noto-unhinted/NotoSansArabicUI-Regular.ttf");
        TextDefaults.getDefaultFonts().add(font1);
        TextDefaults.getDefaultFonts().add(font3);

        Document document = new Document();

        Page page = document.getPages().add();
        page.setPageSize(597.6, 842.4);

        FloatingBox titleFloatingBox = new FloatingBox();
        titleFloatingBox.setTop(85);
        TextFragment titleText = new TextFragment("发明名称 (通过相似专利搜索、内部专利检索或其他来源) 。");
        titleText.getTextState().setCharacterSpacing(5f);
        titleText.getTextState().setFontSize(20.5f);
        titleText.setHorizontalAlignment(HorizontalAlignment.Center);
        titleFloatingBox.getParagraphs().add(titleText);

        TextFragment titleText2 = new TextFragment("أهلا");
        titleText2.getTextState().setCharacterSpacing(5f);
        titleText2.getTextState().setFontSize(20.5f);
        titleText2.setHorizontalAlignment(HorizontalAlignment.Center);
        titleFloatingBox.getParagraphs().add(titleText2);

        page.getParagraphs().add(titleFloatingBox);

        document.save(dataDir + "AsianAdnArabicCharactersTest_TheFirstSuitableFoundFont_21.8_.pdf");

Notice also that TextDefaults is static configuration and will be affected to all threads.