PdfFileEditor concatenate with setCopyLogicalStructure(true) is unusably slow

richbromley · February 13, 2025, 11:31am

PdfFileEditor.concatenate() is extremely slow when using setCopyLogicalStructure(true).
We need this setting in order to retain accessibility tags.

The following code demonstrates the issue:

void testMerge() {
    Instant start = Instant.now();
    final PdfFileEditor editor = new PdfFileEditor();
    editor.setCopyLogicalStructure(true);
    editor.setUseDiskBuffer(true);
    editor.setIncrementalUpdates(false);

    final File[] files = new File("/tmp/pdfs").listFiles((dir, name) -> name.toLowerCase().endsWith(".pdf"));
    final String[] fileNames = Arrays.stream(files).map(File::getAbsolutePath).toArray(String[]::new);
    final File mergedFile = new File("/tmp/merged.pdf");

    editor.concatenate(fileNames, mergedFile.getAbsolutePath());
    System.out.printf("Elapsed time: %dms%n", Duration.between(start, Instant.now()).toMillis());
}

With setCopyLogicalStructure(false), this completes in ~0.2s, but with but with setCopyLogicalStructure(true), it takes nearly 3 minutes.

I suspect the larger 0.2MB and 0.5MB files are somehow choking the concatenate process.

We are potentially concatenating several hundred input files, so as it stands it is taking hours to run this process.

I’ve uploaded the input pdfs as an attachment.
pdfs.zip (511.2 KB)

Many thanks.

Professionalize.Discourse · February 13, 2025, 11:31am

@richbromley

It seems that you are experiencing significant performance issues when using the PdfFileEditor.concatenate() method with the setCopyLogicalStructure(true) option enabled. This setting is essential for retaining accessibility tags, but it appears to be causing a considerable slowdown in the concatenation process.

Here are a few suggestions to potentially improve the performance:

Use Memory Streams: Instead of reading from disk files directly, consider using MemoryStream to load your PDF files into memory. This can reduce the overhead associated with file I/O operations. Here’s an example of how to implement this:

using System.IO;
using Aspose.Pdf.Facades;

public void ConcatenatePDFs(string[] fileNames, string outputFilePath)
{
    using (MemoryStream outputStream = new MemoryStream())
    {
        PdfFileEditor pdfEditor = new PdfFileEditor();
        using (MemoryStream[] inputStreams = new MemoryStream[fileNames.Length])
        {
            for (int i = 0; i < fileNames.Length; i++)
            {
                inputStreams[i] = new MemoryStream(File.ReadAllBytes(fileNames[i]));
            }
            pdfEditor.Concatenate(inputStreams, outputStream);
        }
        File.WriteAllBytes(outputFilePath, outputStream.ToArray());
    }
}

Batch Processing: If you are concatenating a large number of files, consider processing them in smaller batches. This can help manage memory usage and may improve overall performance.
Profile Your Code: Use profiling tools to identify any bottlenecks in your code. This can help you understand where the most time is being spent during the concatenation process.
Check File Sizes: As you mentioned, larger files may be contributing to the slowdown. If possible, try to optimize the size of the PDF files before concatenation.
Update Aspose Library: Ensure that you are using the latest version of the Aspose.PDF library, as performance improvements and bug fixes are regularly released.

If these suggestions do not resolve the issue, you may want to reach out to Aspose support for further assistance, as they can provide more tailored help based on your specific use case.

Sources:
[1]: PdfFileEditor.TryConcatenate | Aspose.PDF for .NET API Reference
[2]: ConcatenateMultiplePDFUsingMemoryStream.cs

richbromley · February 13, 2025, 11:36am

We can’t use MemoryStream as suggested because we quickly end up with an OutOfMemoryException due to the number of files we’re processing.

asad.ali · February 13, 2025, 3:40pm

@richbromley

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44731

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

richbromley · February 13, 2025, 3:57pm

Thanks for the reply - just to avoid any confusion, I’m working with the Java version rather than .Net.

asad.ali · February 13, 2025, 7:22pm

@richbromley

Sure, we have updated the ticket ID and details accordingly.