Converting PDF to DOCX - missing fonts

Hi,
I am having issues with converting PDF file (with embedded fonts) to DOCX using Aspose.PDF.
Code snipper of how I am using it:

   private ByteArrayOutputStream convertPdfToDocx(byte[] pdfBytes) {
    Document pdfDocument = new Document(new ByteArrayInputStream(pdfBytes));

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    pdfDocument.save(byteArrayOutputStream, SaveFormat.DocX);

    return byteArrayOutputStream;
}

And the result is a .docx file, but with ‘default’ fonts. I’d like it to have all the fonts that .pdf file had embedded. Attaching the input file (was not able to attach .docx document, as it’s not authorized):

4pages.pdf (291.0 KB)

Thanks in advance, any help would be appreciated.

@MaciejRocket

In order to generate an output using desired fonts, those fonts should be installed in the system so that API can access them during generation process. The embedded fonts inside PDF documents cannot be used for such purpose. However, we have logged an investigation ticket as PDFJAVA-39909 in our issue tracking system to further analyze your requirements. We will check the feasibility of logged feature and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

@asad.ali
Thanks for your response. I have a question then - if have those fonts somewhere in the classpath, can I ‘force’ Aspose to use them as they were installed system fonts? Maybe something similar to FontSubstitution mechanism, or something?

@MaciejRocket

You can surely use setLocalFontPaths() method to set the path to installed fonts in the system other than default folder. Furthermore, would you please share the sample output document which you have created at your side?

@asad.ali
Thanks.

The problem seems to be that the PDF document have those fonts embedded: (checked using pdfDocument.getFontUtilities().getAllFonts()):

FONT NAME:Tinos-Regular
FONT NAME:Tinos-Bold
FONT NAME:ShadowsIntoLightTwo-Regular
FONT NAME:Lora-Regular
FONT NAME:JustMeAgainDownHere
FONT NAME:Lora-Regular

and I do have those in /usr/share/fonts (we use those fonts to substitute other fonts while using PDF->PNG conversion, and that works.

But when I list fonts in converted docx document, using (wordDocument.getFontInfos()) I see:

FONT INFO NAME: Times New Roman
FONT INFO NAME: Symbol
FONT INFO NAME: Arial
FONT INFO NAME: Calibri
FONT INFO NAME: Cambria Math
FONT INFO NAME: Tinos
FONT INFO NAME: IFFGTO+Tinos-Regular
FONT INFO NAME: UJNKSF+Lora Regular
FONT INFO NAME: Just Me Again Down Here
FONT INFO NAME: JRGWIJ+Shadows Into Light Two

those names don’t match, is it possible that this is the case?

localFontPaths contains both:
/usr/local/share/fonts/
/usr/share/fonts/

I have attached the output document to this post.response-uytre.docx.zip (54.1 KB)

@MaciejRocket

Thanks for sharing these details.

We will further check the provided information and share our feedback with you soon.

@MaciejRocket

We have further investigated and found that it was not A Bug in the API. The word documents have some default style with these system fonts:

  • Times New Roman
  • Symbol
  • Arial
  • Calibri
  • Cambria Math

You could verify it by converting an empty document:

Document pdfDocument = new Document();
pdfDocument.getPages().add();
pdfDocument.save(ouput, SaveFormat.DocX);

These fonts are not used for text in the output document normally but could be used for Aspose Watermark in evolution mode and for situations when some fonts are not found.

@asad.ali
could you please walk me through the process of converting PDF to DOCX using Aspose? I have attached the PDF file:
inventoryChecklist.pdf (40.9 KB)

when I use your online converter (Convert Files Online - Word, PDF, HTML, JPG And Many More) it opens correctly on my Macbook, with proper fonts, etc. However, when I use the code from my original post, output .docx file has fonts missing and replaced by incorrect ones:

inventoryChecklist-missingfonts.docx.zip (56.4 KB)

You can clearly see that this file looks way different than the original PDF (at least on Mac’s default docx viewer). What am I doing wrong?

@MaciejRocket

The online app which you are using for conversion implements Aspose.Words for .NET which uses different conversion engine to carry out PDF to DOCX. We used the below code snippet with Aspose.PDF for Java 21.2 and obtained the attached DOCX file. Could you please open it in MAC and let us know if you notice any issue?

Document doc = new Document(dataDir + "inventoryChecklist.pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setRecognizeBullets(true);
doc.save(dataDir + "Sample_21.2.docx", saveOption);

Sample_21.2.zip (62.4 KB)

@asad.ali

Thanks. Attached .docx file looks good, fonts match those from PDF. However, when I try to convert it using the exact snippet you have provided, it still looks different on my machine. Might this be happening because of some missing system fonts, or, in other words, is it possible that the same code would produce different results based on what fonts are installed on the machine?

Btw. I’ve tried this on both 21.1 (got the license) and 21.2 (without license).

@MaciejRocket

Yes, this is possible as API uses system fonts and chooses suitable fonts while producing PDF document. Which is why we recommend installing all Microsoft essential fonts in the system where API is being used. Please try placing all MS Core fonts in your system and convert the document again. Feel free to let us know if issue still persists.

@asad.ali

When I list all the fonts installed on the OS, using following piece of code:

         GraphicsEnvironment ge = GraphicsEnvironment.getLocalGraphicsEnvironment();
                String[] families = ge.getAvailableFontFamilyNames();
                for (String family : families) {
                    System.out.println(family);
                }

I get the list:

**Cambria**
Caveat
Dancing Script
DejaVu Sans
DejaVu Sans Mono
DejaVu Serif
Dialog
DialogInput
Just Me Again Down Here
Liberation Mono
Liberation Sans
Liberation Serif
Lora
Monospaced
MS Gothic
Open Sans
SansSerif
Serif
Shadows Into Light Two
StandardSymL
**Times New Roman**
Tinos

and as we can see, both Times New Roman and Cambria are there, however, when I open the converted docx on Mac, I can see that those fonts appear missing:

Screen Shot 2021-03-04 at 09.53.09.png (120.4 KB)

@MaciejRocket

Thanks for writing back.

Would you kindly confirm below details:

  • If you convert PDF to DOCX in Windows Environment and open obtained DOCX in MAC, do you see missing fonts issue in the file?
  • OR are you performing the conversion inside MAC and facing the issue in file generated in MAC only? If so, did you try to open that file in Windows Environment?

@asad.ali

There are two environments I run this on:

  • my Mac machine, with following fonts installes:

Aqua Kana

.Arabic UI Display Black

.ArabicUIText

.Helvetica Neue DeskInterface

.SF Compact Display

.SF Compact Rounded

.SF Compact Text

.SF NS Display Condensed

.SF NS Text

.SF NS Text Condensed

Al Bayan

Al Nile

Al Tarikh

American Typewriter

Andale Mono

Apple Braille

Apple Chancery

Apple Color Emoji

Apple LiGothic

Apple LiSung

Apple SD Gothic Neo

Apple Symbols

AppleGothic

AppleMyungjo

Arial

Arial Black

Arial Hebrew

Arial Hebrew Scholar

Arial Narrow

Arial Rounded MT Bold

Arial Unicode MS

Athelas

Avenir

Avenir Book

Avenir Next

Avenir Next Condensed

Ayuthaya

Baghdad

Bangla MN

Bangla Sangam MN

Baoli SC

Baoli TC

Baskerville

Beirut

BiauKai

Big Caslon

Bodoni 72

Bodoni 72 Oldstyle

Bodoni 72 Smallcaps

Bodoni Ornaments

Bradley Hand

Brush Script MT

Chalkboard

Chalkboard SE

Chalkduster

Charter

Cochin

Comic Sans MS

Copperplate

Corsiva Hebrew

Courier

Courier New

Damascus

DecoType Naskh

Devanagari MT

Devanagari Sangam MN

Dialog

DialogInput

Didot

DIN Alternate

DIN Condensed

Diwan Kufi

Diwan Thuluth

Euphemia UCAS

Farah

Farisi

Futura

GB18030 Bitmap

Geeza Pro

Geneva

Georgia

Gill Sans

Gujarati MT

Gujarati Sangam MN

GungSeo

Gurmukhi MN

Gurmukhi MT

Gurmukhi Sangam MN

Hannotate SC

Hannotate TC

HanziPen SC

HanziPen TC

HeadLineA

Hei

Heiti SC

Heiti TC

Helvetica

Helvetica Neue

Herculanum

Hiragino Kaku Gothic Pro

Hiragino Kaku Gothic ProN

Hiragino Kaku Gothic Std

Hiragino Kaku Gothic StdN

Hiragino Maru Gothic Pro

Hiragino Maru Gothic ProN

Hiragino Mincho Pro

Hiragino Mincho ProN

Hiragino Sans

Hiragino Sans CNS

Hiragino Sans GB

Hiragino Sans GB W3

Hiragino Sans W0

Hiragino Sans W1

Hiragino Sans W2

Hiragino Sans W3

Hiragino Sans W4

Hiragino Sans W5

Hiragino Sans W6

Hiragino Sans W7

Hiragino Sans W8

Hiragino Sans W9

Hoefler Text

Impact

InaiMathi

Iowan Old Style

ITF Devanagari

ITF Devanagari Marathi

Kai

Kailasa

Kaiti SC

Kaiti TC

Kannada MN

Kannada Sangam MN

Kefa

Khmer MN

Khmer Sangam MN

Klee

Kohinoor Bangla

Kohinoor Devanagari

Kohinoor Telugu

Kokonor

Krungthep

KufiStandardGK

Lantinghei SC

Lantinghei TC

Lao MN

Lao Sangam MN

Liberation Mono

Liberation Sans

Liberation Serif

Libian SC

Libian TC

LiHei Pro

LingWai SC

LingWai TC

LiSong Pro

Lucida Grande

Luminari

Malayalam MN

Malayalam Sangam MN

Marion

Marker Felt

Menlo

Microsoft Sans Serif

Mishafi

Mishafi Gold

Monaco

Monospaced

MS Gothic

Mshtakan

Muna

Myanmar MN

Myanmar Sangam MN

Nadeem

Nanum Brush Script

Nanum Gothic

Nanum Myeongjo

Nanum Pen Script

New Peninim MT

Noteworthy

Noto Nastaliq Urdu

Optima

Oriya MN

Oriya Sangam MN

Osaka

Palatino

Papyrus

PCMyungjo

Phosphate

PilGi

PingFang HK

PingFang SC

PingFang TC

Plantagenet Cherokee

PT Mono

PT Sans

PT Sans Caption

PT Sans Narrow

PT Serif

PT Serif Caption

Raanana

Rockwell

Sana

SansSerif

Sathu

Savoye LET

Seravek

Serif

Shree Devanagari 714

SignPainter

Silom

Sinhala MN

Sinhala Sangam MN

Skia

Snell Roundhand

Songti SC

Songti TC

StandardSymL

STFangsong

STHeiti

STIXGeneral

STIXIntegralsD

STIXIntegralsSm

STIXIntegralsUp

STIXIntegralsUpD

STIXIntegralsUpSm

STIXNonUnicode

STIXSizeFiveSym

STIXSizeFourSym

STIXSizeOneSym

STIXSizeThreeSym

STIXSizeTwoSym

STIXVariants

STKaiti

STSong

Sukhumvit Set

Superclarendon

Symbol

System Font

Tahoma

Tamil MN

Tamil Sangam MN

Telugu MN

Telugu Sangam MN

Thonburi

Times

Times New Roman

Toppan Bunkyu Gothic

Toppan Bunkyu Midashi Gothic

Toppan Bunkyu Midashi Mincho

Toppan Bunkyu Mincho

Trattatello

Trebuchet MS

Tsukushi A Round Gothic

Tsukushi B Round Gothic

Verdana

Waseem

Wawati SC

Wawati TC

Webdings

Weibei SC

Weibei TC

Wingdings

Wingdings 2

Wingdings 3

Xingkai SC

Xingkai TC

Yuanti SC

Yuanti TC

YuGothic

YuKyokasho

YuKyokasho Yoko

YuMincho

YuMincho +36p Kana

Yuppy SC

Yuppy TC

Zapf Dingbats

Zapfino

البيان

التاريخ

النيل

بغداد

بيروت

جيزة

دمشق

ديوان ثلث

ديوان كوفي

صنعاء

فارسي

فرح

منى

مِصحفي

مِصحفي ذهبي

نديم

نسخ

وسيم

in which everything works correctly.

  • ‘cloud’ env, which is basically a Docker container with an Ubuntu distro running underneath, with fonts from my previous post - here it does not work, it complains about missing fonts (even if they are there, as I can see in the response from getAvailableFontFamilyNames())

Update:

when I copied all the font from my local machine (/System/Library/Fonts) to docker image (/usr/share/fonts), it still does not work.

@MaciejRocket

These fonts should be placed in “/usr/share/fonts/truetype/msttcorefonts” directory as Aspose.PDF scans this folder on Linux like operating systems. Furthermore, as shared earlier, you can use setLocalFontPaths() method to set the path to the fonts so that API can find those fonts during conversion. To check where the API is searching or scanning for the fonts, you can use getLocalFontPaths() methods as well. Please let us know in case suggested information did not help in resolving your issue. We will further proceed to assist you accordingly.

@asad.ali

When I use getLocalFontPaths() the result I get is:

/System/Library/Fonts for Mac

and:

/usr/share/fonts and /usr/local/share/fonts

And all the fonts are copied there during Docker image build.

I have also tried to copy all of them into “/usr/share/fonts/truetype/msttcorefonts” and setting this folder using setLocalFontPaths(), but it didn’t solve the problem.

@MaciejRocket

Thanks for sharing the further details.

We will further investigate the reason behind the issue that you are facing. Would you kindly share the sample Docker file with us in .zip format so that we can setup the similar environment to investigate the issue. We will log another ticket in our issue tracking system and share the ID with you.

@asad.ali

Thanks.

My setup is as follows:

macOS version: High Sierra 10.13.6

Docker file bits:

FROM adoptopenjdk:openj9-focal

in which we copy all the fonts:

COPY ./src/main/resources/aspose-fonts /usr/share/fonts

where “./src/main/resources/aspose-fonts” ia the directory we store all our fonts within the app.

then we rebuild fonts cache:

RUN fc-cache -f

The code we execute to convert pdf is:

    Document document = new Document(new ByteArrayInputStream(pdfBytes));
    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOption.setRecognizeBullets(true);
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    document.save(byteArrayOutputStream, saveOption);
    return byteArrayOutputStream;

And the document is the one attached at the beginning of our conversation. Hope that helps!
Let me know if you want more details.

@MaciejRocket

We tested the scenario in our environment (Linux Ubuntu and CentOS) but we could not replicate the same issue that you are facing. However, we have logged an investigation ticket as PDFJAVA-40251 in our issue tracking system for the sake of further analysis. We will look into details of it and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.