I’m having platform issues (linux,macos) while converting a PDF to DOCX.
On linux, some text is missing, and fonts are not the same.
On macos, all text is empty.
On windows, everything is perfect.
I think it might be a fonts issue, but I cannot find anything about this in the documentation.
Of course, I’d really like a solution here, as my production code runs on linux, and we also have developers running it on macos.
Find attached the input pdf, and output docx files for windows, macos and linux.
Here is the java source code that I use to perform the convertion.
try (final InputStream sourceStream = new ByteArrayInputStream(FileUtils.readFileToByteArray(new File("./in.pdf")))){
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
saveOptions.setRecognizeBullets(true);
saveOptions.setAddReturnToLineEnd(true);
try (Document pdfDocument = new Document(sourceStream)) {
try (final ByteArrayOutputStream targetStream = new ByteArrayOutputStream()) {
pdfDocument.save(targetStream, saveOptions);
targetStream.flush();
final byte[] temp = targetStream.toByteArray();
File f = new File("./out.docx");
Files.write(f.toPath(), temp);
}
}
}
@Loupi
Yes, the library is strongly tied to the fonts available in MS Windows. And when working in other OS, it often helps to install them.
Another nuance when working with the library in Linux:
In which folder do you have the fonts installed?
The package looks for fonts in folders:
“/usr/share/fonts”
“/usr/share/fonts/truetype/msttcorefonts”
“/usr/share/fonts/msttcore”
“/usr/local/share/fonts”
“~/.fonts”
and does not take into account what is located in the cache. Let’s say that the font is shown by the command “fc-list | grep “.ttf” | cut -f2 -d: | sort | uniq" does not mean that it will be used by the library.
@Loupi
Thanks for posting the results.
For OSX, unfortunately, I can not tell you yet. I asked the development team for advice - maybe they will prompt.