Conversion of huge PDF docs to Word - what is the best way?

dr.doc · April 1, 2017, 4:27pm

Hi Aspose team,

I have 800 pages PDF document that I would like to convert to Word. Code is very simple but it takes 17 minutes on my computer.

Document pdfDocument = new Document(uploadFolder + @"\" + fileNameOld);
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.ImageResolutionX = 150;
saveOptions.ImageResolutionY = 150;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
pdfDocument.Save(uploadFolder + @"\" + fileNameNew, saveOptions);

I guess that problem is that Aspose.PDF is reading who document into memory and then starts conversion completely in memory.

Do you have any recommendation how to do that in better way? Any option to convert huge documents in batches like read first 50 pages and save them to Word document, read next 50 pages and append to existing document, etc...

What are you recommendations to improve performances of conversion?

Btw. PDF document is around 24Mb but conversion is occupying more than 1Gb which is kind a really a lot. My expectation was that it will take up to 10x of PDF size and not 40x.

dr.doc · April 1, 2017, 4:30pm

Hi Aspose team,

In my example I have 800 pages PDF document that I would like to convert to Word. Code is simple and it takes 17 minutes on my computer (i7, 16Gb).

Document pdfDocument = new Document(uploadFolder + @"\" + fileNameOld);
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.ImageResolutionX = 150;
saveOptions.ImageResolutionY = 150;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
pdfDocument.Save(uploadFolder + @"\" + fileNameNew, saveOptions);

I guess that problem is that Aspose.PDF is reading whole document into memory and then starts conversion completely in memory.

What would be the best way to do big conversion? Any option to convert huge documents in batches like read first 50 pages and save them to Word document, read next 50 pages and append to existing document, etc...

What are you recommendations to improve performances of conversion?

Btw. PDF document is around 24Mb but conversion is occupying more than 1Gb which is kind a really a lot. My expectation was that it will take up to 10x of PDF size and not 40x.

dr.doc · April 2, 2017, 4:28am

Hi Aspose team,

I tried yesterday to post this question but when I click on posts I am getting "Page not found" so here is the third try.

In my example I have 800 pages PDF document that I would like to convert to Word. Code is simple and it takes 17 minutes on my computer (i7, 16Gb).

Document pdfDocument = new Document(uploadFolder + @"\" + fileNameOld);
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.ImageResolutionX = 150;
saveOptions.ImageResolutionY = 150;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
pdfDocument.Save(uploadFolder + @"\" + fileNameNew, saveOptions);

I guess that problem is that Aspose.PDF is reading whole document into memory and then starts conversion completely in memory.

What would be the best way to do big conversion? Any option to convert huge documents in batches like read first 50 pages and save them to Word document, read next 50 pages and append to existing document, etc...

What are you recommendations to improve performances of conversion?

Btw. PDF document is around 24Mb but conversion is occupying more than 1Gb which is kind a really a lot. My expectation was that it will take up to 10x of PDF size and not 40x.

asad.ali · April 3, 2017, 10:55am

Hi Oliver,

Thanks for your inquiry.

dr_oli:

I tried yesterday to post this question but when I click on posts I am getting “Page not found” so here is the third try.

It was happened due to the existence of word “docs” in the title of the post. Anyways I have merged your three posts so that you can better follow up within single thread.

I have tried to convert one of my sample PDFs into DOC and I did not notice any memory consumption issue. Please note that performance of the API depends upon many factors to be noticed i.e complexity of the input file, version of the API, environment in which you are running the API. I have tested the scenario using Aspose.Pdf for .NET 17.3.0 on Windows 10 x64. We will really appreciate if you please share your environment details along with sample input document so that we can try to replicate the scenario in our environment and get back to you accordingly.

Moreover, our forum supports upload feature up-to 20MB. If your sample document is of size more than that, you can upload it to some public file hosting service and share the link here.

We are sorry for the inconvenience.

Best Regards,

dr.doc · April 3, 2017, 11:11am

Hi Asad,

it was more generic question what Aspose team will recommend in case of processing big PDF documents.

If you make small console application with code:

Document pdfDocument = new Document("path to test file");
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.ImageResolutionX = 150;
saveOptions.ImageResolutionY = 150;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
pdfDocument.Save("Path to export file");

and run it over document that I attached what is your memory consumption?

asad.ali · April 4, 2017, 9:47am

Hi Oliver,

Thanks for sharing input document.

dr_oli:

it was more generic question what Aspose team will recommend in case of processing big PDF documents.

Please note that Aspose.Pdf has no such limitations in term of file size. So there is no such recommended ways to deal with the document of large sizes. As shared earlier the performance of the API depends upon different factors to be noticed i.e complexity of document(s), environment details etc. In general it is recommended to build your project in x64 mode of debugging when you are dealing with large size of the documents. Because in x64 mode the OS (i.e windows) has access to full memory size whereas in x86 mode, it has access to only 2GB of memory regardless how much memory you have installed on your machine.

Furthermore I have tested the scenario in my environment (i.e Windows 10, 8GB RAM, Aspose.Pdf 17.4.0) and noticed that at my side the code execution never ended when I used input document which you have shared. Whereas when I tried same scenario with one of my sample files, all went well. Regarding memory consumption, program kept consuming 280MB of memory.

Therefore, I have logged a performance issue as PDFNET-42526 in our issue tracking system. We will further look into the details of this issue and keep you informed on the status of its correction. Please be patient and spare us a little time.

We are sorry for the inconvenience.

Best Regards,

dr.doc · October 2, 2017, 7:36am

Hi team,

any news here? It is pending since April and in a meantime you resolved some other issues that I reported after this one.

Thx,
Oliver

asad.ali · October 2, 2017, 1:59pm

@dr_oli

Thanks for posting your inquiry.

Please note that resolution/fix against the logged issue(s), depends upon how complex the issue is and due to which component of the API, it is being caused. Furthermore, product team has their own development schedule which depends upon the issue priority, category and API components. Usually the performance related issues need more detailed investigation and take time to be resolved.

However, we have intimated relevant team about your concerns and as soon as we make some significant progress, towards the resolution of the issue, we will let you know. Please be patient and spare us little time.

We are sorry for the inconvenience.