PDF to DOCX performance in .Net

I’m needing to convert a large number of .pdf’s to .docx as a routine process. Using the code below converts the files, but it seems to be slow (2 hours and counting for 500+ documents). Are there options I can tweak to possibly speed up the performance?

//Convert (.pdf -> .docx)
using Aspose.Pdf.Document pdfDoc = new Aspose.Pdf.Document(filename);
Aspose.Pdf.DocSaveOptions saveOptions = new Aspose.Pdf.DocSaveOptions();
saveOptions.Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX;

PDFToWordFeedback pdfToWordFeedback = new PDFToWordFeedback(filename);
saveOptions.CustomProgressHandler = pdfToWordFeedback.PDFtoWordProgress;
saveOptions.WarningHandler = pdfToWordFeedback;

pdfDoc.Save(newFilename, saveOptions);

Any suggestions would be greatly appreciated.

@aspears

Would you please share your environment details i.e. application type, OS Name and Version, etc? We will try to share create an example in our environment and share our feedback with you.

The application will be a console application written in c# .NET 8. It will be run on a Windows 10 machine (Windows 11 eventually).

The gist of the application is it loops through a folder of zip files and finds any .PDFs in them. It then converts each PDF into a Word document and rezips the files. Later, the same process is used but converts the Word documents back to PDF. The program is already in use, but we are converting PDFs to PostScript which is less then ideal so we are looking for better options.

Thanks!

@aspears

Thanks for sharing the details. Please spare us some time in preparing some example. The process of creating sample application can take little time and if during the process, we don’t succeed in getting better performance, we will eventually be creating a ticket in our issue tracking system to address the performance issue. We will be sharing more details with you soon.

@aspears

There are several optimizations and tweaks you can apply to speed up the process. Here are some strategies you can use with Aspose.PDF to improve the performance:

Use Less Precise Formatting

If the exact formatting of the PDF is less important, reducing the precision can improve performance. You can use RecognitionMode to balance speed and quality.

saveOptions.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Textbox;  // Textbox is faster than Flow mode

Avoid Embedded Resources

You can speed up the process by disabling the embedding of fonts and images, if possible.

saveOptions.RecognizeBullets = false;  // Disabling bullet recognition might speed up the process.
saveOptions.AddReturnToLineEnd = false;  // Avoid adding returns at the end of each line for faster conversion.

Processing each file one by one is likely a bottleneck, especially with 500+ files. To improve throughput, you can parallelize the conversion process using Parallel.ForEach or any parallel processing mechanism like Task.Run. However, please make sure that one file is accessed by single thread only in case you implement multi-threading.

At the last, we recommend that you try with 24.8 version which is the latest one and has maximum improvements in terms of memory consumption and speed. Please feel free to let us know after trying the above suggestions.