Japanese Text with Shift-JIS encoding does not Render in output PDF using Java

dev.raz · July 30, 2021, 9:53am

Hi,

When I try to convert a Shift_JIS encoded text file to pdf, Japanese characters are displayed as garbled characters. Even if I set encoding, I get the same result. The file is created with sakura editor.

com.aspose.words.LoadOptions options = new com.aspose.words.LoadOptions();
options.setLoadFormat(com.aspose.words.LoadFormat.TEXT);
options.setEncoding(Charset.forName("Shift_JIS"));
Document document = new Document(decompressInput);
WordsFontWarning warning = new WordsFontWarning();
document.setWarningCallback(warning);
document.setFontSettings(fontSettings);
document.save(outputStream, SaveFormat.PDF);

sjis-ecncode.7z (153 Bytes)

What should I do get a correct pdf?

Regards,
Raz

dev.raz · July 30, 2021, 2:57pm

Also for ANSI encoded txt file, I am getting the same problem.
I have set the following encoding for ANSI.

options.setEncoding(Charset.forName("Cp1252"));

I also tried with out setting encoding for both cases and got the same result.

sample-ansi.7z (155 Bytes)

tahir.manzoor · July 30, 2021, 4:55pm

@dev.raz

Please import the document with load options as shown below to avoid this issue.

Document document = new Document(MyDir +"sjis-ecncode.txt", options);

dev.raz · August 2, 2021, 6:08am

Hi @tahir.manzoor,

Thanks, I missed setting the load options to the document.

dev.raz · August 2, 2021, 7:00am

Hi, @tahir.manzoor,

My text/csv file can have any different encoding. Is it possible for Aspose to convert those files to PDF with out mentioning the encoding type in code.?

Regards,
Raz

tahir.manzoor · August 2, 2021, 4:09pm

@dev.raz

You do not need to specify encoding in load options. By default, Aspose.Words tries to detect a proper
encoding during loading a text file into it’s DOM.

dev.raz · August 3, 2021, 4:17am

Hi @tahir.manzoor,

I tried with out specifying encoding in load options. But for Shift-JIS and Ansi encoded text files, the Japanese characters are not converted.

I have used the following code.

com.aspose.words.LoadOptions options = new com.aspose.words.LoadOptions();
options.setLoadFormat(com.aspose.words.LoadFormat.TEXT);
Document document = new Document(GZIPInputStream, options);
document.setFontSettings(fontSettings);
document.save(outputStream, SaveFormat.PDF);

I have attached the files I used and the result.

Encoding-issue.7z (38.4 KB)

Regards,
Raz

tahir.manzoor · August 3, 2021, 4:06pm

@dev.raz

You are testing with a very simple document. Please use original TXT document (with more text) that has Shift-JIS or Ansi encoding. If you still face problem, please ZIP and attach your original text document here for testing. We will investigate the issue and provide you more information on it.

dev.raz · August 4, 2021, 6:31am

Hi @tahir.manzoor,

Thanks for the suggestion. We will check with files having more text.

But we would like to know, is there any limitation for Aspose library to auto detect the encoding of the file(text/csv).?
Because the file we shared previously was given by the customer. Since customer can give any number of characters in the file(even single character), we would like to know the limitations of Aspose library in this case.

[ Update : ]

We tried files with more text in text file and the Aspose library is auto detecting the encoding. But if we try the same with CSV files, aspose library is not auto detecting the encoding and results in garbled characters.

I have attached the CSV files we used for testing.

csv-encoding-samples.7z (512 Bytes)

Regards,
Raz

tahir.manzoor · August 4, 2021, 4:03pm

@dev.raz

We have tested the scenario using the latest version of Aspose.Words for Java 21.7 and have not found the shared issue. So, please use Aspose.Words for Java 21.7. We have attached the output PDF files with this post for your kind reference.

ansi-encoding.csv.java.pdf (19.6 KB)
sample-shift-jis.csv.java.pdf (22.4 KB)
unicode-encoding.csv.java.pdf (22.9 KB)
utf-8-encode.csv.java.pdf (22.8 KB)

tahir.manzoor · August 5, 2021, 7:20am

A post was split to a new topic: Text encoding is wrong after importing CSV

dev.raz · August 5, 2021, 7:32am

@tahir.manzoor

I will track the new topic for CSV related issue.

Could you reply for the below query? We use Aspose.words for converting txt to pdf.

When we use a txt file with very few characters, the encoding is not detected properly and we are getting garbled characters in PDF output. If we use txt file with more characters the encoding is set correctly. So we would like to know is there any limitation for Aspose library to auto detect the encoding of the file(text/csv).?

Regards,
Raz

tahir.manzoor · August 5, 2021, 3:40pm

@dev.raz

Aspose.Words detects the encoding of TXT file for small and big size documents. However, if TXT document has fewer text or has text with different encoding, Aspose.Words selects the suitable encoding for text.

Please remove English text from the TXT files and convert them into PDF files. You will get the correct output.

dev.raz · August 6, 2021, 4:17pm

Hi @tahir.manzoor

I understand this issue is not related to file size. Thanks.

So if a text document have both English and Japanese characters together and the file is saved with Shift-JIS encoding, Aspose could not recognize Shift-JIS encoding while auto-detecting the encoding type. Should I consider this as a limitation of Aspose.?

Regards,
Raz

tahir.manzoor · August 6, 2021, 5:42pm

@dev.raz

Please note that Aspose.Words reads the whole text file, checks the numbers of characters with different encoding, choose the encoding which greater numbers of characters.

In your case, when text characters with Shift-JIS are greater than normal encoding, Aspose.Words choose Shift-JIS. You can check it by adding more Japanese text into TXT file.