Unusually high memory usage upon iterating pages

psimon · March 16, 2022, 8:52pm

Hi,

I have a 30MB PDF file with 50 pages, which makes Aspose.Pdf.Document to consume a whopping 21.4 GB of memory.
memoryUsage.png (91.1 KB)

I’ve prepared a very simple project, which loads the document, iterates over the pages getting Contents.Count (i.e. no editing), then waits for a key-press (so that I could capture the memory usage at the end). Please note that you need to add a license file yourself, because the evaluation version limits the number of pages it can access.
(I’ll provide a link to it soon in my next post)

I’ve tried a few approaches to reduce the memory usage:

Calling the GC
Issuing a Document.FreeMemory()
“Incremental saving” with Document.Save()

These function calls were made every page. To my surprise, none of them had any positive effect, and Document.Save() seemingly closed the underlying stream, so it even breaks the program.

My questions would be the following:
1.) What is the reason for this unproportionally large memory usage? The PDF file doesn’t seem to be unusually large, which makes the whole thing annoying.
2.) What can I do to make the processing of such files feasible on machines with less resources (say, “just” 16GB of RAM)?
3.) How the “incremental save” feature should be used?

Thanks in advance for your support!

psimon · March 16, 2022, 10:04pm

I know it is old fashioned, but I had to split up the zip file into 5MB pieces. You just have to concatenate the files with the following on linux:

cat AsposeBigPdfMemoryUsage_aa.zip AsposeBigPdfMemoryUsage_ab.zip AsposeBigPdfMemoryUsage_ac.zip AsposeBigPdfMemoryUsage_ad.zip AsposeBigPdfMemoryUsage_ae.zip AsposeBigPdfMemoryUsage_af.zip > AsposeBigPdfMemoryUsage.zip

The SHA256 checksum of the resulting file should be:
ba60e511523780bb76f35e3ac4162c41ea99b9a9d6ab0a3446d0f608ffd9f855

AsposeBigPdfMemoryUsage_af.zip (4.8 MB)
AsposeBigPdfMemoryUsage_ae.zip (5 MB)
AsposeBigPdfMemoryUsage_ad.zip (5 MB)
AsposeBigPdfMemoryUsage_ac.zip (5 MB)
AsposeBigPdfMemoryUsage_ab.zip (5 MB)
AsposeBigPdfMemoryUsage_aa.zip (5 MB)

tahir.manzoor · March 17, 2022, 4:05am

@psimon

It is quite difficult to answer such questions because CPU performance and memory usage all depend on complexity and size of the documents you are loading/generating.

When PDF document is closed, all the DOM data is purged from memory during the next garbage collector cycle. Please note that the memory may not be released until you close the application.

We suggest you please use the latest version of Aspose.PDF for .NET 22.2. Hope this helps you.

psimon · March 17, 2022, 8:27am

I’m using the latest version (22.2), so these symptoms apply to that (and older versions too).
Have you checked the project, especially the PDF file I sent? Maybe there is something with it, that explains the memory usage, and could lead to a solution.
Do you have any suggestions for questions 2.) and 3.)?

tahir.manzoor · March 17, 2022, 3:36pm

@psimon

Unfortunately, the input PDF file is corrupted. We cannot open it in Adobe reader and Aspose.PDF also does not import it. Please check the attached image.
image.png (28.1 KB)

The 16GB memory is even enough to process the documents. However, one should make sure to use x64 mode of debug while processing large files. Also, a simple or small size document may contain complex structure and a lot of elements in it that leads to large memory consumption because the API loads all required resources in the memory while processing it.

The incremental save approach is recommended during PDF generation process. You can use it to gradually build/create a document by adding content and other objects into it. In case of existing PDF document, it overwrites the file and it does close all the opened resources.

psimon · March 17, 2022, 4:52pm

That’s strange. Have you checked the checksum of the merged zip file? Me and my coworker could open it without problems after downloading, merging and unzipping the project.

Since this forum limits the uploaded file’s size to 5MB (the error message says 50000kb though), I had to split the zip up.
Anyway, now you can also access the merged version here, but only for the next 7 days:

The checksum of the zip should be the same. The PDF within has the following SHA-256 sum:
4e3570fcebb8bfa93fbda084b3e9bcbed32eb85f3bdd9fd98d2ce78aa78c6a88

Could you please try this?

tahir.manzoor · March 18, 2022, 3:53am

@psimon

Thanks for sharing the PDF file. We can open it in our side. We are investigating this issue in detail and will update you shortly.

tahir.manzoor · March 19, 2022, 9:04am

@psimon

We have logged this problem in our issue tracking system as PDFNET-51530. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.