Convert PDF to Docx platform issues

Loupi · June 28, 2023, 3:49pm

Hello,

Using aspose-pdf 23.5 for java.

I’m having platform issues (linux,macos) while converting a PDF to DOCX.
On linux, some text is missing, and fonts are not the same.
On macos, all text is empty.
On windows, everything is perfect.

I think it might be a fonts issue, but I cannot find anything about this in the documentation.
Of course, I’d really like a solution here, as my production code runs on linux, and we also have developers running it on macos.

Find attached the input pdf, and output docx files for windows, macos and linux.

Here is the java source code that I use to perform the convertion.

try (final InputStream sourceStream = new ByteArrayInputStream(FileUtils.readFileToByteArray(new File("./in.pdf")))){
  DocSaveOptions saveOptions = new DocSaveOptions();
  saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
  saveOptions.setRecognizeBullets(true);
  saveOptions.setAddReturnToLineEnd(true);
  
  try (Document pdfDocument = new Document(sourceStream)) {
    try (final ByteArrayOutputStream targetStream = new ByteArrayOutputStream()) {
      pdfDocument.save(targetStream, saveOptions);
      targetStream.flush();

      final byte[] temp = targetStream.toByteArray();
      File f = new File("./out.docx");
      Files.write(f.toPath(), temp);
    }
  }
}

in.pdf (30.4 KB)
out.windows.docx (57.9 KB)
out.linux.docx (71.3 KB)
out.mac.docx (47.7 KB)

Regards

sergei.shibanov · June 28, 2023, 6:21pm

@Loupi
Yes, the library is strongly tied to the fonts available in MS Windows. And when working in other OS, it often helps to install them.
Another nuance when working with the library in Linux:
In which folder do you have the fonts installed?
The package looks for fonts in folders:
“/usr/share/fonts”
“/usr/share/fonts/truetype/msttcorefonts”
“/usr/share/fonts/msttcore”
“/usr/local/share/fonts”
“~/.fonts”
and does not take into account what is located in the cache. Let’s say that the font is shown by the command “fc-list | grep “.ttf” | cut -f2 -d: | sort | uniq" does not mean that it will be used by the library.

Loupi · June 28, 2023, 6:55pm

@sergei.shibanov

Thank you for the reply. I’m going to install the fonts packages and make sure that they are in the folders you mentioned.

I’ll post an update soon.

sergei.shibanov · June 29, 2023, 3:31am

@Loupi
Yes, post results.

Loupi · June 29, 2023, 3:40pm

@sergei.shibanov

I managed to get it working on both Debian and CentOS, by installing the ttf-mscorefonts-installer package.
Still looking on how to do it on OSX.

sergei.shibanov · June 29, 2023, 6:03pm

@Loupi
Thanks for posting the results.
For OSX, unfortunately, I can not tell you yet. I asked the development team for advice - maybe they will prompt.

sergei.shibanov · June 30, 2023, 2:19pm

@Loupi
For Mac OS you should take the fonts from your windows system and put them on Mac OS.

On MacOS, copy fonts both to /opt/local/share/fonts and /Library/Fonts and run ‘fc-cache -fv && sudo fc-cache -fv’.

as colleagues write, usually two fonts “arial.ttf” and “times.ttf” are enough.