PDF to DOCX - Some Text becomes Gibberish

TLDR; Docx text bugs out when opened on a European Mac system but not American Mac or Windows systems…

I am working on an implementation of Aspose PDF that transforms a PDF into a DOCX Word Document. When I open the document on my machine (Windows 10 - USA), it looks good, but when my colleague opens it on his machine (MacOS - Europe) the majority of the text is unreadable. It seems that the text renders this way only when the Font Size equals 12. This was validated by looping through each ‘TextSegment’ and adjusting the Font Size when it equaled 12 (or 11.939… to be exact). This worked but resulted in spacing and other formatting inconsistencies. Is this a known issue and is there any advice you could provide in resolving this?

Details on Aspose PDF Implementation:

  • Input is a PDF File
  • Output is DOCX
  • Runs as an AWS Lambda function (Java 8 Runtime)
  • Recognition Mode is set to Flow

@jmcginnis

The issue can also be related to missing Microsoft fonts in the system. Please make sure to install all Microsoft essential fonts in the MAC system and see if issue still persists. Also, please share your sample PDF document along with the screenshots of the issue so that we can further proceed to assist you accordingly.

bugged text.jpg (608.1 KB)
test.pdf (376.4 KB)

Thanks for the reply. Attached is the PDF used as input as well as screenshot taken of the converted Docx file as seen from my colleague’s Mac. Keep in mind that, were I to set the font size to 11 or so while keeping the font type as Arial MT, the text will render correctly albeit with spacing issues.

@jmcginnis

We really apologize for the inconvenience. Could you please also share the problematic output Word file with us in .zip format?

Sure thing. FYI we are using v20.7 of Aspose.PDF Java.

docxExport.zip (241.0 KB)

@jmcginnis

It seems a system/environment-specific issue as we opened the shared .docx file in Windows and it was rendered correctly. Please check the attached screenshot. The API is generating the .docx file correctly and due to missing fonts in MAC OS, the word file is not being rendered correctly. Please try to install Windows Fonts in the system or try to open this file on another machine to check whether the issue is actually related to document generation via API or specific environment.
Docinmsword.png (50.4 KB)

Alright, thank you for your time. It’s helpful to know that this isn’t an issue with the actual generation process.

I was able to solve this issue. Manually editing the document.xml file (after unzipping the docx archive) to change the Fonts ascii “LOFFWC+Arial” (I don’t that 6 letter prefix is predictable?) to just “Arial” solved the issue.

Is there a way I can accomplish this with the Aspose PDF Java sdk? I can automate it by unzipping, editing the XML, and zipping back up but if it can be done with Aspose, that’d be preferable. Thank you!

@jmcginnis

We are afraid that Aspose.PDF does not offer any way to process or modify the Word files. You can use Aspose.Words for this purpose as it is specialized to deal with Word documents. You can please create a post in respective category for changing the font name in Word document and you will be assisted there accordingly.