Hi everyone,
I need some help with converting DOC, DOCX, and PDF documents to markdown files. Currently, information such as titles is being extracted as plain text. Is it possible to retain the heading layout during conversion? I checked the MarkdownSaveOptions (MarkdownSaveOptions class | Aspose.Words for Python) but couldn’t find a relevant option. Below is the code I am using for extraction. Please let me know if there are any improvements I can apply to preserve the heading layout.
pdf_path = "./data/test.pdf"
def aspose_filter(file_path: str) -> str:
with tempfile.TemporaryDirectory() as temp_dir:
pdf = aw.Document(file_path)
save_path = os.path.join(temp_dir, "pdf.md")
pdf.save(save_path)
with open(save_path, "r") as f:
text = f.read()
return text