Pdf to docx, the text content is garbled

Using aspose PDF to convert to doc, the text appears garbled.

 Document pdfDocument = new Document(_dataDir + "PDFToDOC.pdf");
        // Save the file into MS document format
        pdfDocument.save(_dataDir + "PDFToDOC_out.doc", SaveFormat.Doc);

涨乐财富通密码重置业务操作指引(1).pdf (7.8 MB)

@yjsdfsdf

Please check the attached output DOCX that was generated in our environment using 23.7 version of the API and valid license. Below is the code snippet that we used:

Document doc = new Document(dataDir + "涨乐财富通密码重置业务操作指引(1).pdf");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setAddReturnToLineEnd(false);

saveOption.setCloseResponse(false);
doc.save(dataDir + "涨乐财富通密码重置业务操作指引(1).docx", saveOption);

涨乐财富通密码重置业务操作指引(1).docx (386.9 KB)

We did not notice any garbled text in the output. However, there were some formatting issues in the table at the end of the document. Can you please try using the latest version and let us know in case you still notice any issues.

I tested it on a windows computer and it’s still garbled, could it be that I don’t have the font on my computer?

My license doesn’t support this 23.7 version, so I didn’t use this license. You can try it without the license, it shouldn’t cause this mess.

I downloaded the docx file you uploaded, and the text in it is obviously faulty and garbled, please, isn’t that what you’re seeing, I’ve taken a screenshot of it for you
Dingtalk_20230803135510.jpg (52.7 KB)

I used this way to extract the text and found that I couldn’t get the content at all。
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(“ExtractBoldText.pdf”);
com.aspose.pdf.TextFragmentAbsorber textAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
pdfDocument.getPages().accept(textAbsorber);
for (TextFragment textFragment:textAbsorber.getTextFragments())
{
System.out.println(textFragment.getText());
}
I don’t know what’s so special about this document. Help me.

@yjsdfsdf

The issue looks related to the missing fonts. It seems you do not have Windows Fonts installed in your system. Please install all Fonts that support this language characters. In case issue still persists, please let us know.

I installed the SF PRO TEXT and DINPRo-Medium fonts, still no luck! Can you confirm that you have the relevant fonts on your computer, please?

I looked at the output document you uploaded and it seems like there is a problem as well, maybe you are not getting normal output either, can you open your docx document and take a screenshot for me?

@yjsdfsdf

Please check the screenshot of the file in our environment. image.png (139.3 KB)

Can you also try installing FangSong font as well? It looks like the document requires this font to render the characters correctly.

I put the ttf file in the fonts directory but it still doesn’t work, the screenshot you posted looks normal, I don’t know how to do it now, are there any other ideas?
fangsong.zip (5.7 MB)

I opened the xml file of this document and found that it should be RHMBTW+FangSong this kind of font, what is this kind of font and how to install it?
font.jpg (227.4 KB)

This font may come with the system, please ask if your computer language is Japanese, maybe this font is in Japanese!

I might know, the office I have open is the 2016 version, it seems like the higher versions of office are fine, no mess, thank you

@yjsdfsdf

Are you saying that your issue is resolved when opening the document with the latest version of the MS Word? Can you please confirm?

Yes, my coworkers they use ms 2021 to open word is ok

@yjsdfsdf

It is nice to know that your issue has been resolved. Please keep using our API and feel free to let us know by creating a new topic in case you need further assistance.