Looking for some information around the incrementalUpdates method on the PdfFileEditor class. We are seeing performance issues when this is set to true and the generated output would appear to be the same either way.
The documentation on this method is limited, could someone explain what this method does and whether or not we need to set this to true. Also what effect this would have setting it to false. We are setting this before eventually calling the concatenate method.
Thanks.
@douglasdallas
Could you please provide more details about the performance issues you are experiencing with the incrementalUpdates method? Specifically, what kind of performance issues are you seeing?
For a large PDF file the merge takes hours rather than minutes when the incrementalUpdates is set to false.
@douglasdallas
Setting IncrementalUpdates to true means that document will be saved with incremental updates i.e. new data will be added after end of original file (and list of new objects will be in cross-reference section)
[PDF specification]
7.5.6 Incremental Updates
The contents of a PDF file can be updated incrementally without rewriting the entire file. When updating a PDF file incrementally, changes shall be appended to the end of the file, leaving its original contents intact.
NOTE: The main advantage of updating a file in this way is that small changes to a large document can be saved quickly. There are additional advantages:
In certain contexts, such as when editing a document across an HTTP connection or using OLE embedding (a Windows-specific technology), a conforming writer cannot overwrite the contents of the original file. Incremental updates may be used to save changes to documents in these contexts.
However, can you please share your sample files and complete code snippet with us so that we can investigate the reasons behind this issue and address it accordingly.
Thanks for the reply.
For our usecase we are trying to find the most performant (IO/memory) way of merging together a whole load of inputstreams. This in more extreme cases could be in the hundreds.
With an eye on keeping the memory footprint down, we have code that does this via temporary files, rather than all input streams.
Along these lines:
// Use PdfFileEditor to retain the tags logical structure
final PdfFileEditor editor = new PdfFileEditor();
editor.setCopyLogicalStructure(true);
editor.setUseDiskBuffer(true);
editor.setIncrementalUpdates(true);
// For each input stream, create a temporary file and save the document
File[] inputStreamFiles = new File[inputStreams.length];
for (int i = 0; i < inputStreams.length; i++) {
inputStreamFiles[i] = File.createTempFile("merged-is", ".pdf");
try (Document currentDocument = new Document(inputStreams[i])) {
currentDocument.save(inputStreamFiles[i].getAbsolutePath());
}
}
// Create a temporary file to hold the merged result
tempFile = File.createTempFile("merged", ".pdf");
tempFilenames = Arrays.stream(inputStreamFiles).map(File::getAbsolutePath).toArray(String[]::new);
// Concatenate the files to the temp file
editor.concatenate(tempFilenames, tempFile.getAbsolutePath());
Would this be your recommended approach? From what you said previously I think we do not need to set the incremental updates to true. I think we got this code originally from an an online sample.
Thanks.
@douglasdallas
Yes, you do not need to use incrementalUpdates flag in this case. Also, saving the files temporarily to physical path instead of memory stream is a good solution to reduce the memory usage as Streams occupy a lot memory specially when working with huge number of documents. However, you can please share your sample files with us in case you face such issues. We will log an investigation ticket in our issue tracking system to address it.