AsposeWords convert pdf to md cause messy character

I convert this pdf to markdown, and in md many mojibake (nonsense characters displayed when software fails render text according to its intended character encoding).I’ve installed a lot of fonts
code:

    document = aw.Document(file_path)
    document.save("output2/test2.md")

政府采购非招标采购方式程序解读(一)如何成立谈判小组和询价小组.pdf (604.3 KB)

@David_Matin

Cause

The issue of mojibake (nonsense characters) appearing in the Markdown output when converting a PDF using Aspose.Words is often related to font encoding problems. This can occur if the PDF contains fonts that are not properly embedded or if the text is encoded in a way that doesn’t translate well during the conversion process.

Solution

To address this issue, you can try the following steps:

  1. Ensure Fonts are Embedded: Check if the fonts used in the PDF are embedded. If they are not, consider using a PDF that has all necessary fonts embedded.

  2. Use a Different Encoding: When saving the document, you might want to specify the encoding explicitly. Unfortunately, Aspose.Words does not provide an option to set encoding directly for Markdown output, but ensuring that the source PDF is correctly encoded can help.

  3. Update Aspose.Words: Make sure you are using the latest version of Aspose.Words, as updates often include bug fixes and improvements related to file conversions.

  4. Test with Different PDFs: If possible, test the conversion with different PDF files to see if the issue is specific to the PDF you are currently using.

Code

Here’s a basic code snippet for converting a PDF to Markdown:

import com.aspose.words.Document;

public class PdfToMarkdown {
    public static void main(String[] args) {
        try {
            Document document = new Document("path/to/your/file.pdf");
            document.save("output/test.md");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Next Steps

  1. Verify the PDF file for embedded fonts.
  2. Update to the latest version of Aspose.Words.
  3. Test with different PDF files to isolate the issue.

Please implement these steps and let me know if they resolve the problem. If you continue to experience issues, feel free to share more details about the PDF or any specific error messages you encounter, and I will assist you further.

@David_Matin
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-28478

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.