Convert to Markdown

pmrc · July 9, 2024, 12:30am

Hi everyone,

I need some help with converting DOC, DOCX, and PDF documents to markdown files. Currently, information such as titles is being extracted as plain text. Is it possible to retain the heading layout during conversion? I checked the MarkdownSaveOptions (MarkdownSaveOptions class | Aspose.Words for Python) but couldn’t find a relevant option. Below is the code I am using for extraction. Please let me know if there are any improvements I can apply to preserve the heading layout.

pdf_path = "./data/test.pdf"

def aspose_filter(file_path: str) -> str:
    with tempfile.TemporaryDirectory() as temp_dir:
        pdf = aw.Document(file_path)
        save_path = os.path.join(temp_dir, "pdf.md")
        pdf.save(save_path)
        
        with open(save_path, "r") as f:
            text = f.read()
            
    return text

vadim.saltykov · July 9, 2024, 7:11am

@pmrc
Could you please ZIP and upload your PDF input document here to reproduce the issue? We will check the issue and provide you more information.

pmrc · July 12, 2024, 4:21am

@vadim.saltykov

pdf_samples.zip (2.1 MB)

Among the attached documents, my case involves ko_1.pdf and ko_2.pdf, but I have also attached en_1.pdf since handling Korean documents might be inconvenient. The issue of not retaining heading information is the same for all three documents. Thank you for your help.

vadim.saltykov · July 12, 2024, 6:37am

@pmrc
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27185

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.