Memory allocation during merge pdf

Laksh · December 4, 2020, 9:33pm

I am trying to merge multiple pdfs into single merge PDF. Since this code will be executed in AWS lambda I wanted to check how much memory I need to allocate. Ignore the fact that Lambda has 512 MB of tmp storage and assume everything will be executed in memory
Below is my code. For brevity I have removed few lines

IList<MemoryStream> files = GetFiles();

try
{
	using(var outStream = new MemoryStream())
	{
		using(var mergePDF = new Document())
		{
		  foreach(var file in files)
		  {
			 var sourcePDF = new Document(file);
			 mergePDF.Pages.Add(sourcePDF.Pages);             
		  }
		  
		  mergePDF.Save(outStream);
		  
		  //At this point how much memory will it take?
		}
	}
}
finally
{
   foreach(var f in files)
   {
       f.Dispose();
   }
}

Consider for example, I have 10 pdf files and each is 10MB.At the point, after mergePDF.Save() is invoked and before mergePDF is dispose how much memory will it take? 400MB?

100MB for all individual file since they are not disposed yet.
100MB for 10 sourcePDF
100MB for mergePDF
100MB for outStream

One thing I noticed, I cannot dispose individual sourcePDF and underneath stream until mergePDF.Save(outStream) is invoked. It would be good if i can close the individual file early on.

asad.ali · December 6, 2020, 8:22pm

@Laksh

The memory allocation by the API is purely dependent upon the type of PDF documents you are processing. API loads all PDF content into memory after loading it in Document Object. A complex (even small size) PDF document can cause high memory allocation.

Keeping in view that API needs to keep all resources in the memory for initialized Document, yes, it would not be possible to dispose it before it is saved. When you concatenate the pages of the PDF document, the source Document will remain in use until an output is generated along with allocated content.

Nevertheless, we have logged an investigation ticket as PDFNET-49120 in our issue tracking system to further analyze upon the scenario you presented. We will further look into its details and let you know as soon as the ticket is resolved. Meanwhile, it would be great if you can please share some sample PDF documents for our reference so that we can also check what type of PDF documents you are processing.

Laksh · December 21, 2020, 9:53pm

If everything is loaded into memory why would we need source pdf open?

asad.ali · December 22, 2020, 6:56pm

@Laksh

The API keeps all resources and content in memory with their link to the respective Document instance. Once the document is closed/disposed/saved, all opened/allocated/linked resources also get disposed from the memory which were linked to that particular instance.