Merging Files - Memory Usage

bkramerQTCM · August 30, 2024, 6:57pm

I am trying to merge 40 1GB files into 1 40GB file by doing roughly the following:

List<string> files; // has all 40 filenames
Document mergeDoc = new Document();
// other setup
try
{
  foreach(string file in files)
  {
    Document srcFile;
    try
    {
      srcFile = new Document(file);
      mergeDoc.Pages.Add(srcFile.Pages);
      // a lot of other processing, including specific bookmark processing
    }
    catch { // error handling }
    finally
    {
      srcFile.Dispose();
    }
  }
}
finally
{
  mergeDoc.Save();
  mergeDoc.Dispose();
}

I have a few questions:

Q1) What would be the peak amount of RAM usage with this approach? I estimate it would be around 41GB (40GB for mergeDoc + 1GB for the final srcFile).

Q2) Working off of Q1, this program must be able to handle 5 threads of the above code at the same time. If each thread has 40GB total of files to merge would the system require 200GB+ of RAM?

Q3) Would opening, saving, and closing mergeDoc inside of the foreach loop provide any memory savings?

Q4) Is there a way to append pages, and cherry-pick bookmarks, from a srcFile to mergeDoc without having the entirety of mergeDoc loaded into memory?

Q5) Is there another way to process these merges in a more memory efficient manner that I haven’t mentioned?

Note: I do not think I can utilize PdfFileEditor.Concatenate() as the program needs to process bookmarks and handle merge errors file-by-file, not post-merge.

Professionalize.Discourse · August 30, 2024, 6:58pm

@bkramerQTCM

To address your questions regarding merging large PDF files and memory usage, here are some insights:

Q1) Peak RAM Usage

Your estimation of peak RAM usage is reasonable. When merging 40 files of 1GB each into a single 40GB file, the memory usage could peak around 41GB. This includes the memory required for the mergeDoc and the currently processed srcFile. However, actual memory usage may vary based on the implementation and the system’s memory management.

Q2) Handling Multiple Threads

If you run five threads simultaneously, each processing 40GB of files, the total memory requirement could indeed exceed 200GB. Each thread would potentially require its own 41GB, leading to significant memory consumption. It’s crucial to monitor the system’s memory capacity to avoid performance degradation or crashes.

Q3) Memory Savings by Opening and Closing `mergeDoc`

Opening, saving, and closing mergeDoc within the foreach loop could provide some memory savings, as it would release the memory used by mergeDoc after each iteration. However, this approach may introduce overhead due to repeated opening and closing of the document, which could impact performance.

Q4) Appending Pages Without Full Memory Load

Currently, there isn’t a direct method to append pages and selectively process bookmarks from a source file without loading the entire mergeDoc into memory. The typical approach involves managing the document in memory, which can be resource-intensive.

Q5) More Memory-Efficient Merging

For a more memory-efficient merging process, consider processing the files in smaller batches or using streams. This way, you can read and write data incrementally, reducing the memory footprint. Additionally, using the PdfFileEditor class for concatenation might be beneficial, as it is designed for efficient file handling, although it may not support the specific bookmark processing you require.

If you need to handle bookmarks and errors file-by-file, you might need to implement a custom solution that processes each file sequentially while managing memory usage effectively.

If you have further questions or need more specific examples, feel free to ask!

bkramerQTCM · August 30, 2024, 7:31pm

Say I am merging the 40 1GB files one at a time using using PdfFileEditor.Concatenate(). How much memory would be consumed on the 40th call to Concatenate()?

asad.ali · August 30, 2024, 11:21pm

@bkramerQTCM

Memory consumption by the API is not meant to be constant that we can know by just checking the file size. In fact it entirely depends upon the structure of the PDF document. Even a small sized PDF can occupy memory more than expected due to its complex structure and resources in it. Whereas, it is often seen that most of the larger files get processed without consuming significant memory.

Therefore, we request you that you please use the latest version of the API in an x64 architecture and if you face any issues or you notice that the memory consumption is higher than expected, please let us know. We will further proceed to investigate the case and address it accordingly.

Merging Files - Memory Usage

Q1) Peak RAM Usage

Q2) Handling Multiple Threads

Q3) Memory Savings by Opening and Closing mergeDoc

Q4) Appending Pages Without Full Memory Load

Q5) More Memory-Efficient Merging

Q3) Memory Savings by Opening and Closing `mergeDoc`