Subject: loading PDF documents for conversion in Aspose.Pdf very slow compared to MS Office documents

Hi,

With Aspose.Pdf we’ve noticed that loading PDF documents for conversion ( var document = new Aspose.Pdf.Document(stream)) takes considerably longer compared to loading MS Office documents in Aspose.Words. For example, loading a large MS Word document (3000+ pages, text only/no images) is up to 6x faster compared to loading the same document converted to PDF. We’ve tested this with Aspose.Words and Aspose.Pdf version 20.x and 21.x.; the machine is not limited in terms of available memory/CPU-power and de input files are loaded from a local NVMe SSD.

Is there any way we can speed up loading PDF documents or is PDF inherently slower to process compared to MS Office documents?

Kinds regards!

@T_Limburg

Could you please share the sample document for our reference so that we can test the scenario in our environment and share our feedback with you accordingly.

Hi,

Here is a WeTransfer link to a sample document: https://we.tl/t-xCciljEBAz

With kind regards,

Thomas Limburg

@T_Limburg

We tested the scenario in our environment and noticed that the API took ~1 second while loading the sample PDF. We tested the case using Aspose.PDF for .NET 21.4. Could you please share how much time it is taking at your side and what is your expected time consumption?

Hi,

Since your result of ~1 seconds is very different from ours (several minutes) we went back to our
test setup and discovered an error in our measurement. Our timed result was for loading AND conversion to PDF/A; not just for loading. Our bad! :frowning:

Nevertheless the total time it takes to convert PDF to PDF/A is still something we would like reduce if possible. We’ve tested several conversion settings but unfortunately none so far have resulted in a significant improvement. Do you have any suggestions or is PDF to PDF/A inherently a time consuming operation (for large text documents)?

Kind regards,
Thomas

@T_Limburg

Can you please share in which PDF/A format you are converting your PDF file e.g. PDF/A_1a? Also, please share your sample code snippet which you are using for conversion. We will further proceed to assist you accordingly.

Hi,
We are using this function to convert to PDF_A_1B:

               private long ConvertToPDF(string sourceFilePath, string destinationFilePath)
               {
                        long conversionTime = -1;

                        using (var document = new Aspose.Pdf.Document(sourceFilePath))
                        {
                                  try
                                  {
                                           var options = new Aspose.Pdf.PdfFormatConversionOptions(Aspose.Pdf.PdfFormat.PDF_A_1B)
                                           {
                                                     ErrorAction = Aspose.Pdf.ConvertErrorAction.None,
                                           };

                                           // optimization options are normally read from configuration file; the values
                                           // below are our default values but we've tried various combinations
                                           var optimizationOptions = new Aspose.Pdf.Optimization.OptimizationOptions
                                           {
                                                    LinkDuplcateStreams = true,
                                                    RemoveUnusedObjects = true,
                                                    RemoveUnusedStreams = true,
                                                    ImageCompressionOptions =
                                                    {
                                                              CompressImages = false,
                                                              ImageQuality = 50
                                                    }
                                           };

                                           document.OptimizeResources(optimizationOptions);

                                           Stopwatch sw = Stopwatch.StartNew();

                                           document.Convert(options);

                                           sw.Stop();
                                           conversionTime = sw.ElapsedMilliseconds;

                                           // temporarily disabled: document.Save(destinationFilePath);
                                  }
                                  finally
                                  {
                                           document.FreeMemory();
                                  }
                        }

                        return conversionTime;
               }

Kind regards,
Thomas

@T_Limburg

Thanks for sharing the sample code snippet.

We had deleted the file from our system after performing initial tests. Now, the shared link of WeTransfer has been expired. Could you please share the link or file again so that we can again test the scenario accordingly and share our feedback with you?

Hi,

We have made two documents available which you can access by this link: https://we.tl/t-fBQ0UZ6etT

  1. “aspose_test_35MB.pdf” (7298 pages text only); this is the one we sent you before.

  2. “Large Lorem ipsum 17MB” (3649 pages text only); smaller document to speed up testing.

Our average timed results for executing the .Convert() method:

35,5 minutes for 35 MB file

7,8 minutes for 17 MB file

System specs:

Intel Core i7-6700 with 4 cores/8 threads running at 4,0 GHz

32 GB RAM

Kind regards,

Thomas

@T_Limburg

Thanks for sharing the documents.

We have tested the scenario using both files and noticed the same time consumption as you mentioned. We have logged an issue as PDFNET-49974 in our issue tracking system for the sake of investigation. We will look into the details of this scenario and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.