Aspose.Words convert pdf to markdown Issue

I have some pdf document,about 300 pages.when I split pdf by page, then convert every pdf page to markdown.After converting dozens of pages, the speed will become very slow, and it may take two minutes to convert one page,And it takes up a considerable amount of memory, about 1.5G.

@David_Matin

It seems that you are experiencing performance issues when converting a large PDF document to Markdown using Aspose.Words. This can happen due to the size of the document and the complexity of its content, which may lead to increased memory usage and slower processing times.

Here are a few suggestions to help improve the performance of your conversion process:

  1. Batch Processing: Instead of converting all pages at once, consider processing the PDF in smaller batches. This can help manage memory usage and may improve overall speed.

  2. Optimize PDF Content: If possible, try to optimize the PDF content before conversion. Reducing the complexity of the PDF (e.g., simplifying images or removing unnecessary elements) can lead to faster conversion times.

  3. Use Save Options: When converting documents, you can utilize the MarkdownSaveOptions class to specify options that might help streamline the conversion process. This includes controlling how images are handled and adjusting other settings that could impact performance.

  4. Monitor Resource Usage: Keep an eye on memory and CPU usage during the conversion process. If you notice spikes, it may be beneficial to run the conversion on a machine with more resources or to optimize your code to handle memory more efficiently.

  5. Update Aspose.Words: Ensure that you are using the latest version of Aspose.Words, as performance improvements and bug fixes are regularly included in updates.

If you continue to face issues, consider reaching out to Aspose support with specific details about your implementation and the performance metrics you are observing. They may provide additional insights or optimizations tailored to your use case.

For more information on converting documents to Markdown, you can refer to the official documentation here.

If you have any further questions or need additional assistance, feel free to ask!

import aspose.words as aw
result = get_filename_without_ext(source_file)
document = aw.Document(source_file)
document.save(f"{result}.md")

@David_Matin Could you please attach your input document here for testing? We will check the issue and provide you more information.

https://drive.google.com/file/d/1kacs5XFbctgi9meCYtQRbs1ci6ndNV4W/view?usp=drive_link

@alexey.noskov

@David_Matin
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27317

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

You should note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model . On the other hand PDF documents are fixed page format documents. While loading PDF document Fixed Page Document structure is converted into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity and might be quite resource consuming.

if i want to convert pdf document, what should i use, aspose.pdf?

@alexey.noskov

@David_Matin Yes, you can use Aspose.PDF to process PDF documents. Aspose.PDF is designed specifically to work with PDF documents.
Aspose.Words also can be used to convert PDF documents, but as it was mentioned it is designed to work with MS Word documents at first.

can Aspose.PDF convert pdf to markdown?

@alexey.noskov

@David_Matin According to Aspose.PDF documentation it supports only loading MD not saving to MD format:
https://docs.aspose.com/pdf/python-net/supported-file-formats/

You can contact my colleagues in Aspose.PDF team in the appropriate support forum for more details.

The issues you have found earlier (filed as WORDSNET-27317) have been fixed in this Aspose.Words for .NET 24.10 update also available on NuGet.