Bad support of unicode characters in text file

cap.aspose · September 18, 2017, 4:19pm

Hi,

We are using Aspose.Words 17.7 for .NET for converting input file types to textual PDF or image files. We got an issue with unicode symbols in text files. We convert input TXT file to PDF by the such code:

Document wordDocument = new Document(contentStream);
wordDocument.Save(outputStream, SaveFormat.Pdf);

If we make it on the computer with Windows 10, without installed Microsoft Office we get the output PDF with unicode symbols are changed to empty boxes. However, this may be because the corresponding fonts are not installed on the computer. OK, we install Microsoft Office 2013 and repeat the experiment. Now we get the PDF file in which the Georgian language is correctly displayed, but the Ethiopian language and Runes are still replaced by empty boxes. But if we open input document in Microsoft Word in same environment the all unicode symbols are dysplayed correctly, and MS Word makes correct PDF.

Thus, Aspose.Words can’t correctly determine all Unicode characters.

In attached ZIP:

UTF8.txt - input txt file;
UTF8ByAspose.pdf - PDF file generated by Aspose.Wodrs without Microsoft Office being installed;
UTF8ByAsposeWithOffice.pdf - PDF file generated by Aspose.Wodrs after Microsof Office installation;
UTF8ByWord.pdf - PDF file saved by Microsoft Word.

Thanks,
Roman

Unicode.zip (417.6 KB)

tilal.ahmad · September 19, 2017, 3:32am

@cap.aspose

Thanks for your inquiry. We have tested the scenario and noticed the reported issue. We have logged a ticket WORDSNET-15894 in our issue tracking system for further investigation. We will keep you updated about the issue resolution progress within this forum thread.

We are sorry for your inconvenience.

cap.aspose · September 29, 2017, 6:50pm

I carried out an additional research of this issue, and maybe what I found will help determine the cause.

If I convert the same source file not to PDF but to TIFF, in the same environment, all the symbols are displayed correctly.

In TIFF I convert using the following code:

Document wordDocument = new Document(contentStream);
var imageSaveOptions = new Aspose.Words.Saving.ImageSaveOptions(SaveFormat.Tiff);
imageSaveOptions.PixelFormat = ImagePixelFormat.Format24BppRgb;
imageSaveOptions.Resolution = resolution;
imageSaveOptions.PageCount = 1;

stream = new MemoryStream();
wordDocument.Save(stream, imageSaveOptions);

No special options need be used, but Aspose.Words finds the fonts and displays all the symbols correctly.

In PDF I convert by this way:

var wordDocument = new Document(inputStream);
Aspose.Words.Saving.PdfSaveOptions options = new PdfSaveOptions();
wordDocument.Save(outputStream, options);

I thought that this might be because the fonts are not embedded in the output PDF file. Then I tried to set different values for options:

options.EmbedFullFonts (true / false);
options.FontEmbeddingMode (EmbedAll);
options.Compliance (PdfA1a / PdfA1b);
options.UseCoreFonts (true / false).

I tried to use each option separately and in combination. Unfortunately this did not help, the characters are not displayed.

That is, the problem is only when converting to PDF, when converting to TIFF, there is no problem.

The same issue is noticed with HTM files.

In attached ZIP:

UTF8ByAspose.txt.tif - TIFF file, generated by Aspose.Words from UTF8.txt (UTF8.txt in attached ZIP in my first message);
Hello - Unicode.txt - initial TXT file;
Hello - Unicode.txt.pdf - PDF file, generated by Aspose.Words from Unicode.txt;
Hello - Unicode.txt.tif - TIFF file, generated by Aspose.Words from Unicode.txt;
unilang.htm - initial HTM file;
unilang.htm.pdf - PDF file, generated by Aspose.Words from unilang.htm;
unilang.htm.tif - TIFF file, generated by Aspose.Words from unilang.htm;

Thanks,
Roman

Files.zip (672 KB)

tilal.ahmad · September 30, 2017, 4:21am

@cap.aspose

Thanks for sharing the additional information. Our product team has completed the issue investigation and now they are working on the fix. However, we have also passed on your findings to our product team, they will look into it as well.

awais.hafeez · November 17, 2017, 8:43am

@cap.aspose,

The issues you have found earlier (filed as WORDSNET-15616) have been fixed in this Aspose.Words for .NET 17.11 update and this Aspose.Words for Java 17.11 update.

Please also check the following articles:

cap.aspose · February 16, 2018, 5:06pm

Hi,

I downloaded Aspose.Words for .NET version 18.1 and repeated conversion in same environment as I did it earlier. Some characters began to be displayed, but some characters are still displayed as empty boxes.

In attached ZIP two PDF files, which I generated by Aspose.Words version 18.1.

The issue is still reproduced.

Thanks,
Roman

Files18.zip (110.5 KB)

awais.hafeez · February 17, 2018, 4:19am

@cap.aspose,

You have attached Aspose.Words for .NET 18.1 generated output PDF files only. Please try latest version of Aspose.Words for .NET i.e. 18.2 on your end and see how it goes. In case the problem still remains, please ZIP and upload your input documents here for testing. We will investigate the issue on our end and provide you more information.

cap.aspose · February 20, 2018, 2:19pm

Hi,

I downloaded Aspose.Words for .NET 18.2 and tried to convert files using this version. There is no difference with version 18.1, issue is reproduced. Only the issue with converting HTML files is fixed, and the issue with converting text files remains.

In attached ZIP:

“Hello - Unicode.txt”, “UTF8.txt” - initial TXT files;
"Hello - Unicode.txt.1802.pdf, “UTF8ByAspose.txt.1802.pdf” - output PDF files, generated by Aspose.Words.

Thanks,
Roman

Unicode1802.zip (112.1 KB)

awais.hafeez · February 21, 2018, 4:10am

@cap.aspose,

We tested the scenario and have managed to reproduce the same problems on our end. For the sake of correction, we have logged the following problems in our issue tracking system.

WORDSNET-16495: related to UTF8.txt
WORDSNET-16496: related to Hello - Unicode.txt

Our product team will further look into the details of these problems and we will keep you updated on the status of corrections. We apologize for your inconvenience.

aspose.notifier · June 8, 2018, 10:02am

The issues you have found earlier (filed as WORDSNET-16495) have been fixed in this Aspose.Words for .NET 18.6 update and this Aspose.Words for Java 18.6 update.

cap.aspose · January 28, 2019, 10:51am

Hi,

I downloaded the latest version of the Words (19.1) and found that the issue still exists. Please see the files in attached ZIP:
folder 18.1 contains files produced by the Words 18.1;
folder 19.1 contains files produced by the Words 19.1;
folder SourceFiles contains initial files;
folder NotePad contains screenshots of initial files displayed in NotePad on the same system.

Please note, that in TIFF files, produced by all version of Words all the characters are displayed correctly.

Thanks,
Roman

Files.zip (1.5 MB)

awais.hafeez · January 28, 2019, 3:28pm

@cap.aspose,

Please refer to the following article to learn what fonts are missing on your machine.
How to Receive Notification of Missing Fonts and Font Substitution during Rendering

You need to install missing fonts (e.g. try installing “Arial Unicode” font to see if the situation improves) and then execute the following code:

var options = new Aspose.Words.LoadOptions();
options.LoadFormat = Aspose.Words.LoadFormat.Text;
options.Encoding = Encoding.Unicode;
Document doc = new Document("E:\\temp\\files\\Hello - Unicode.txt", options);
doc.Save("E:\\temp\\files\\Hello - Unicode-19.1.tiff");
doc.Save("E:\\temp\\files\\Hello - Unicode-19.1.pdf");

Here is the output produced on our end: Hello - Unicode.zip (1.5 MB)

aspose.notifier · February 6, 2019, 4:39am

The issues you have found earlier (filed as WORDSNET-16496) have been fixed in this/) update