Write Unicode Text of any Language (Punjabi Gujarati) & UTF-8 Characters in DOCX & Convert to PDF using Java API

Hi Team,

We have to support different languages in docx format, and as a sample we write different languages to a document as below and saving the document in docx format.

when we open the document these fonts are not resolving and text displaying as squares.

We are trying to resolve it by using different options like setting True Fonts by using FontSettings and Embed System Fonts etc. But nothing helped us.

Please suggest the approach that we needs to use to build a word document in docx format with different languages.

Complete Code has been attached hereSampleCode.zip (17.0 KB)

Note: We are using aspose-words-18.8-jdk16.

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);			
builder.write("Punjabi (India): ੂਪਾ ੌਹਗਮਕ ਵੀਦੈਲ ਿਦੰ ਰਹਸਜਾ੍ ਦਨਾੀ ੂਪਾ ਤੋਬ ੍ਦੁ");
builder.writeln();
builder.write("Gujarati (India): ૂપા ૌહગમક વીદૈલ િદં રહસજા્ દનાી ૂપા તોબ ્દુ");
doc.save("C:/APA/docs/Aspose/Unicode/Test.docx");

Thanks,
Srinivas

1 Like

I too facing this issue from couple of days. Please propose a solution for this.

Thanks,
Veera

@srinivasc,

You need to specify a suitable Font name before writing Text in different languages. For example:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);

builder.write("Punjabi (India): ");

builder.getFont().setName("Verdana"); // You need to specify Punjabi Font name here
builder.write("ੂਪਾ ੌਹਗਮਕ ਵੀਦੈਲ ਿਦੰ ਰਹਸਜਾ੍ ਦਨਾੀ ੂਪਾ ਤੋਬ ੍ਦੁ");

builder.getFont().clearFormatting();
builder.writeln();
builder.write("Gujarati (India):");

builder.getFont().setName("Arial"); // You need to specify suitable Gujarati Font name here
builder.write(" ૂપા ૌહગમક વીદૈલ િદં રહસજા્ દનાી ૂપા તોબ ્દુ");

builder.getFont().clearFormatting();
builder.writeln();

doc.save("D:\\temp\\awjava-18.9.docx");

Hope, this helps.

Hi @awais.hafeez,

Thanks for your quick reply.

I already saw this solution. But in my case it won’t work out, Because I will fetch the bunch of data from the database that have data in different languages and sadly I don’t have any fonts related information in it.

So I have to identify font in each location of the data and set the the fonts accordingly, unfortunately I don’t see any solution to find out the font type in java.

Example Data:

the quick brown fox jumped over the lazy dog ૂપા ૌહગમક વીદૈલ િદં રહસજા્ દનાી ૂપા તોબ ્દુ 快速的棕色狐狸跳過懶惰的狗 тхе љуицк броњн фоџ јумпед овер тхе лаѕз дог فاث ضعهؤن لاقخصى بخء تعةحثي خرثق فاث مشئغ يخل տհէ խըիգկ բրուն ֆոց ճըմպէդ ովէր տհէ լազե դոք ੂਪਾ ੌਹਗਮਕ ਵੀਦੈਲ ਿਦੰ ਰਹਸਜਾ੍ ਦਨਾੀ ੂਪਾ ਤੋਬ ੍ਦੁ тхе љуицк броњн фоџ јумпед овер тхе лаѕз дог otğ frnvm çıhgz ahö krspğe hcğı otğ lujd ehü:)Ended here last…:joy::joy::joy::joy::joy::joy: 絵文字"

Thanks,
Srinivas

1 Like

@srinivasc,

I think, you can use Google’s translation APIs to detect Language of a given string and based on returned language specify correct Font name in Aspose.Words:
https://cloud.google.com/translate/docs/detecting-language

@awais.hafeez,

Thanks for your suggestion.

Is there any other way that without specifying font type each and every time when language change to this work?
Because in PDF this is working fine with the following setup.

ArrayList fontSources = new ArrayList(Arrays.asList(FontSettings.getDefaultInstance().getFontsSources()));
FolderFontSource folderFontSource = new FolderFontSource("C:/APA/docs/Aspose/Fonts", true);//Location where all the fonts are available
fontSources.add(folderFontSource);
// Convert the Arraylist of source back into a primitive array of FontSource objects.
FontSourceBase[] updatedFontSources = (FontSourceBase[]) fontSources.toArray(new FontSourceBase[fontSources.size()]);
// Apply the new set of font sources to use.
FontSettings.getDefaultInstance().setFontsSources(updatedFontSources);
doc.save("C:/APA/docs/Aspose/Unicode/Test.pdf",SaveFormat.PDF);

Thanks,
Srinivas

@srinivasc,

Please also provide Aspose.Words generated DOCX file containing square boxes and corresponding PDF file showing the desired output here for further testing. We will investigate the scenario further on our end and provide you more information.

@awais.hafeez,

Here I have attached the sample code I have used to generate word and pdf documents and also the documents generated using this code.unicode.zip (125.7 KB)

I could not attach the Fonts folder that I am referring in the code because it is around 500 MB. You can refer the fonts from your OS(C:\Windows\Fonts). If the specific language font available then the it is resolving in PDF.

Please let me know if any other details required.

Thanks,
Srinivas

@srinivasc,

I am afraid, there is no simple way to detect language from string (it is actually out of the scope of Aspose.Words). You can try using “Arial Unicode” font which contains almost all glyphs from different languages. We have installed this “Arial Unicode” font on our end. We do not see any square boxes in your shared “unicode.docx” document when opened with MS Word 2016. Please check this screenshot.

@awais.hafeez,

Thanks for all your support. Now I too able to get the all language related fonts with “Arial Unicode” in PDF document except emoji’s.
For the emoji’s I have tried with “Segoe UI Emoji” font and aspose-words-16.1.0-java and this didn’t work.

Please let me know the font type and the aspose version that I have to use to support emoji’s.

Thanks,
Srinivas

@srinivasc,

We suggest you please upgrade to the latest version of Aspose.Words for Java i.e. 18.9 and see how it goes on your end?

In case the problem still remains, please ZIP and upload your input Word document (you are getting this problem with) and Aspose.Words generated PDF file showing the undesired behavior here for testing. Please also provide a comparison screenshot highlighting the problematic emojis in Aspose.Words generated output with respect to your expected output and attach it here for our reference We will then investigate the issue further on our end and provide you more information.

With the aspose version 18.7, it worked for me. Thanks for all your support.

@srinivasc,

It is great that you were able to resolve this issue on your end. Please let us know any time you have any further queries.

I have problem with the below fonts in pdf.
Downloaded the below fonts and referring these true fonts while writing to PDF document, all these fonts are displaying as square boxes.
Please confirm which font I have to use to render this fonts in PDF document.

Segoe UI Symbol ----- ⏴⏵⏶⏷:pause_button::stop_button::record_button:
Malgun Gothic ----- ᇹᇺᇻᇼᇽᇾᇿ
Sylfaen -----ⴀ ⴁ ⴂ ⴃ ⴄ ⴅ ⴆ ⴇ ⴈ ⴉ ⴊ
Microsoft JhengHei ---- ㇐ ㇑ ㇒ ㇓
Microsoft JhengHei ------ ㄭ

Thanks,
Srinivas

@srinivasc,

Please ZIP and upload your input Word document (you are getting this problem with) and Aspose.Words generated PDF file showing the undesired behavior here for testing. We will investigate the issue on our end and provide you more information.

Please find the code that I have used to generate pdf document and sample PDF document that I have generated out of it. I have the below fonts in fonts folder. buildPDF.zip (38.0 KB)

ARIALUNI, msjh, msjhbd, msjhl, SEGOEUISL, seguiemj, seguisym and sylfaen.

Please let me know if any other details required.

@srinivasc,

Please simply install the following font files:

  • Malgun Gothic
  • Segoe UI Symbol
  • Microsoft JhengHei
  • Sylfaen

Hope, this helps.

I have installed this fonts already and referring this fonts while generating the PDF document, but still not working. The code and PDF document generated attached in my previous comment itself.

Please let me know if any other details required.

@srinivasc,

Please try using the following code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.getFont().setName("Malgun Gothic");
builder.write("Malgun Gothic: ᇹᇺᇻᇼᇽᇾᇿ");
builder.writeln();
builder.getFont().setName("Segoe UI Symbol");
builder.write("Segoe UI Symbol : ⏴⏵⏶⏷");
builder.writeln();
builder.getFont().setName("Microsoft JhengHei");
builder.write("Microsoft JhengHei : ㇐ ㇑ ㇒ ㇓");
builder.writeln();
builder.write("Microsoft : JhengHei ㄭ");
builder.writeln();
builder.getFont().setName("Sylfaen");
builder.write("Sylfaen  : ⴀ ⴁ ⴂ ⴃ ⴄ ⴅ ⴆ ⴇ ⴈ ⴉ ⴊ");
doc.save("D:\\temp\\awjava-18.10.docx");
doc.save("D:\\temp\\awjava-18.10.pdf");

Hi @awais.hafeez,

Thanks for providing the example code, by using this this fonts are coming in PDF.

But unfortunately for our needs this won’t help us. In our data we will not have any clue where which language font will come. So we have to work it out with only true type fonts.
FontSettings.getDefaultInstance().setFontsFolder(“C:/tmp/Fonts”, true);

we have “Arial Unicode MS” true type font available in the above specified Fonts folder which is supporting most of the font families like Chinese, Gujarathi, Punjabi and Hindi etc.

However we have issues with Malgun Gothic, Segoe UI Symbol, Microsoft JhengHei and Sylfaen. we downloaded corresponding font files and placed in the Fonts folder but it is not working.

Please confirm if it is achievable using True Type Fonts.
We are already having Aspose licence.