Aspose-words (python) to convert RTF to PDF

Hello,

I am trying to convert a translated RTF document into a PDF document, and for some reason the special characters are not converting.

Here is the general code:

doc = aw.Document('my_file.rtf')
saveOptions = aw.saving.PdfSaveOptions()
saveOptions.embed_full_fonts = True
doc.save('my_file.pdf', saveOptions)

This code opens the RTF successfully and saves a PDF, but all of the Chinese (for example) characters are rendered in the PDF as white boxes.

Is there an extra step required when using languages with special characters, or is this a bug? We have a paid license. The online free tool does not create a faulty PDF like this.

Attaching RTF and PDF for reference.

rtfandpdf.zip (712.4 KB)

Thanks,
Sean

@stillot2 The problem occurs because fonts required for rendering the document are not available in the environment where the document is converted to PDF. To build an accurate document layout the fonts are required. If Aspose.Words cannot find the fonts used in the document the fonts are substituted . This might lead into the layout and appearance difference.
To properly render the attached RTF document the following fonts should in availabe:

  • Arial
  • Times New Roman
  • Calibri
  • SimSun

@alexey.noskov Thank you for that information. That makes sense why I can see the symbols on the RTF on my machine.

We are building these documents inside of AWS Lambda (Amazon Linux OS). I am working through some documentation on how to add fonts to test it.

I am curious if supporting Aspose you might have some knowledge or documentation about what I am trying to do?

Also – I would like to ask how did you know it was those four fonts that I needed?

  • Arial
  • Times New Roman
  • Calibri
  • SimSun

We will be offering all google-translation-api’s possible translations (137 languages). Is there a way for me to check the document for these languages or will I need to return here and zip them for the other languages that fail to convert to PDF?

Thanks,
Sean

@stillot2

There is an example for .NET version that demonstrates how to use fonts stored in S3:
https://docs.aspose.com/words/net/integration-in-aws-lambda/#how-to-use-fonts-stored-in-s3-storage-in-aws-lambda
Unfortunately, in Python version, currently, there is no way to extend StreamFontSource base class. but you can use MemoryFontSource.
The feature request for adding a feature to extend base classes is logged as WORDSNET-25598.

I have used .NET version and IWarningCallback to get notification about font substitution. Unfortunately, callbacks are also not yet available in Python version - a feature request is logged as WORDSNET-24685.

As a simple check, you can convert document to PDF using MS Word and check what fonts are used in the output PDF.

@alexey.noskov Thanks for all the info. Adding a few fonts to our AWS Lambda helped a lot.

For reference, we found these are the fonts pre-loaded on amazon linux:

"DejaVu Sans Bold", 
"DejaVu Sans Bold Oblique",
"DejaVu Sans ExtraLight", 
"DejaVu Sans Oblique", 
"DejaVu Sans", 
"DejaVu Sans Condensed Bold",
"DejaVu Sans Condensed Bold Oblique", 
"DejaVu Sans Condensed Oblique", 
"DejaVu Sans Condensed"

We recently added these fonts to project:

"Arial Bold", 
"Arial Unicode MS", 
"Arial", 
"Calibri", 
"SimSun", 
"NSimSun", 
"Times New Roman", 
"Times Roman", 
"Times Bold", 
"Times Italic",
"Times Bold Italic"

Implementation:

doc = aw.Document('file.rtf')
font_settings = aw.fonts.FontSettings()
# I thought this default might help render a large set
font_settings.substitution_settings.default_font_substitution.default_font_name = "Arial Unicode MS" 
font_sources = font_settings.get_fonts_sources()
# /var/task is lamba's working dir where project is loaded
folder_font_source = aw.fonts.FolderFontSource("/var/task/fonts", True)
updated_font_sources = list(font_sources)
updated_font_sources.append(folder_font_source)
font_settings.set_fonts_sources(updated_font_sources)
doc.font_settings = font_settings
        
saveOptions = aw.saving.PdfSaveOptions()
saveOptions.embed_full_fonts = True
doc.save('file.pdf', saveOptions)

The following few sets are the only items we are missing. I am wondering if you might be able to provide a font list like you did before?

  • si: Sinhala; Sinhalese
  • my: Burmese
  • km: Central Khmer
  • dv: Divehi; Dhivehi; Maldivian
  • am: Amharic
  • ti: Tigrinya

The PDFs all output similarly to the first option with white boxes for non supported characters. The output PDFs all had embedded fonts so their size is too large to attach. Hopefully this is enough!

Thanks
rtfs_only.zip (53.9 KB)

@stillot2 The following fonts should be available for rendering the attached documents:

  • test-am-2023-06-28-01_37_50.857394.rtf
    • Arial
    • Calibri
    • Nyala
  • test-dv-2023-06-28-01_37_33.221005.rtf MS Word uses the following fonts. But Aspose.Words renders the document improperly. We have logged the problem as WORDSNET-25601.
    • Arial
    • MVBoli
  • test-km-2023-06-28-01_37_15.817348.rtf
    • Arial
    • Calibri
    • DaunPenh
  • test-mni-Mtei-2023-06-28-01_36_58.115499.rtf MS Word uses the following fonts. But Aspose.Words renders the document improperly. We have logged the problem as WORDSNET-25602.
    • Arial
    • NirmalaUI
  • test-my-2023-06-28-01_36_40.375718.rtf
    • Arial
    • Calibri
    • MyanmarText
  • test-si-2023-06-28-01_36_21.385186.rtf
    • Arial
    • Calibri
    • IskoolaPota
  • test-ti-2023-06-28-01_38_08.817759.rtf
    • Arial
    • Calibri
    • Nyala

Also, you can use free Noto fonts as fallback fonts. Please see our documentation for more information:
https://docs.aspose.com/words/python-net/manipulating-and-substitution-truetype-fonts/#font-fallback-settings-from-xml

Hey @alexey.noskov thank you.

I am slightly confused – I added free Noto fonts for each Language I mentioned before, and then added

font_settings.substitution_settings.default_font_substitution.default_font_name = "Arial Unicode MS"
font_sources = font_settings.get_fonts_sources()
folder_font_source = aw.fonts.FolderFontSource("/var/task/fonts", True)
updated_font_sources = list(font_sources)
updated_font_sources.append(folder_font_source)
font_settings.set_fonts_sources(updated_font_sources)
font_settings.fallback_settings.load_noto_fallback_settings()  # this line

And the missing languages are properly converted to PDF… however, my other languages that were behaving well are now unable to render.

I have not tested them all, but it looks like Chinese, Korean. Does loading the noto fallbacks skip over Arial Unicode MS and SimSun that I loaded to fix languages previously?

@stillot2 By default Aspose.Words uses fallback settings which mimics the Microsoft Word fallback and uses Microsoft office fonts. But when fallback settings which uses Google Noto fonts are used, Microsoft office fonts are not used. You can create your own substitution settings and add the fonts availabe in your environment. To achieve this you can use FontFallbackSettings.save method to save the loaded fallback settings, then update the XML file and load it using FontFallbackSettings.load method.

The issues you have found earlier (filed as WORDSNET-25601) have been fixed in this Aspose.Words for .NET 23.7 update also available on NuGet.

The issues you have found earlier (filed as WORDSNET-25602) have been fixed in this Aspose.Words for .NET 23.7 update also available on NuGet.