Aspose.PDF Java Out of Memory on Merge of Large number of Documents

dmckinney · October 28, 2019, 3:44pm

When we split a document into individual pages of greater than a 1000 and attempt to merge those file back into a single multi page PDF document we get the following Out of Memory exception thrown:
Oct 28, 2019 10:43:06 AM org.junit.platform.launcher.core.DefaultLauncher handleThrowable
WARNING: TestEngine with ID ‘junit-jupiter’ failed to execute tests
java.lang.OutOfMemoryError: Java heap space
at java.lang.Throwable.fillInStackTrace(Native Method)
at java.lang.Throwable.(Throwable.java:88)
at java.lang.Throwable.(Throwable.java:99)
at java.lang.Error.(Error.java:70)
at java.lang.VirtualMachineError.(VirtualMachineError.java:53)
at java.lang.OutOfMemoryError.(OutOfMemoryError.java:58)
at java.lang.String.(String.java:673)
at java.lang.String.(String.java:608)
at com.aspose.pdf.internal.ms.System.l10l.lI(Unknown Source)
at com.aspose.pdf.internal.l8k.l0l.lf(Unknown Source)
at com.aspose.pdf.internal.l9j.lu.lI(Unknown Source)
at com.aspose.pdf.internal.l9j.ld.lI(Unknown Source)
at com.aspose.pdf.internal.l5y.l1l$lI.deserialize(Unknown Source)
at com.aspose.pdf.internal.l9u.le.deserialize(Unknown Source)
at com.aspose.pdf.internal.l5y.l1j$lI.deserialize(Unknown Source)
at com.aspose.pdf.internal.l9u.le.deserialize(Unknown Source)
at com.aspose.pdf.internal.l0k.lh.lI(Unknown Source)
at com.aspose.pdf.internal.l0k.lh.lI(Unknown Source)
at com.aspose.pdf.internal.l0k.lh.lI(Unknown Source)
at com.aspose.pdf.internal.l5y.l1j.l3y(Unknown Source)
at com.aspose.pdf.internal.l5y.l1j.l3v(Unknown Source)
at com.aspose.pdf.internal.l5y.l1j.l5l(Unknown Source)
at com.aspose.pdf.internal.l8k.l0v.lf(Unknown Source)
at com.aspose.pdf.internal.l0n.l0if.lt(Unknown Source)
at com.aspose.pdf.DocumentInfo.(Unknown Source)
at com.aspose.pdf.ADocument.l1p(Unknown Source)
at com.aspose.pdf.ADocument.lI(Unknown Source)
at com.aspose.pdf.ADocument.(Unknown Source)
at com.aspose.pdf.Document.(Unknown Source)
at com.aspose.pdf.facades.APdfFileEditor.lI(Unknown Source)
at com.aspose.pdf.facades.APdfFileEditor.concatenate(Unknown Source)
at com.aspose.pdf.facades.PdfFileEditor.concatenate(Unknown Source)
at com.epiq.discovery.pdf.utils.PDFUtils.pdfMergeImageImagePDFBox(PDFUtils.java:68)
at com.epiq.discovery.pdf.utils.PDFUtilsTest.pdfMergeImageImagePDFBox(PDFUtilsTest.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:675)
at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:125)

Below is a snipit of code that we are using to attempt to accomplish this task:
PdfFileEditor pdfEditor = new PdfFileEditor();
pdfEditor.setIncrementalUpdates(true);
pdfEditor.setConcatenationPacketSize(100);
pdfEditor.concatenate(singlePageFilePathList.toArray(new String[0]), outputFilePath);

Thanks in advance for any suggestions.

asad.ali · October 28, 2019, 10:32pm

@dmckinney

Would you kindly try using Aspose.PDF for Java 19.9 in your environment and also please try increasing Java heap size. In case you still face any issue, please share your sample PDF document along with the environment details i.e. OS Name and Version, JDK Version, Java Heap Size and Application Type with us. We will test the scenario in our environment and address it accordingly.

aweech · November 5, 2019, 7:49pm

Can you provide an SFTP site in order for us to upload some sample files?

asad.ali · November 6, 2019, 6:42am

@aweech

You can please attach your sample files here with the post using Upload button. In case your files are larger, you may please upload it to some public file sharer e.g. Dropbox or Google Drive and share the link with us.

aweech · November 6, 2019, 3:42pm

The files are about 600mb. I am not able to use a public file sharer. Do you have any other way to send these files?

asad.ali · November 6, 2019, 7:50pm

@aweech

Would you please confirm if you are using latest version of the API. Aspose.PDF for Java 19.10 has just been released and we request you to try your scenario with that. Also, please concatenate PDF files using DOM approach which is recommended. While splitting the PDF documents, you can use Page.Dispose() method to free up captured memory.

In any case, if you are facing similar exception, we will be needing your complete code snippet along with sample PDF file.

We regret that we do not offer any other medium to share the files. You can upload 600mb data on Google Drive as it offers this space free. Please let us know about your feedback.

dmckinney · November 8, 2019, 4:54am

I attempted to adjust the min and max heap allocated on my test runs and still received the same errors.I am including my samples I tested with and the project I used to test with.

asad.ali · November 8, 2019, 3:12pm

@dmckinney

We are testing the scenario and will get back to you shortly.

aweech · November 23, 2019, 3:58am

Do you have an update on this?

Farhan.Raza · November 25, 2019, 9:16am

@aweech

We have logged a ticket with ID PDFJAVA-39021 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

aweech · July 10, 2024, 9:50pm

Is there any update that can be provided on this issue?

asad.ali · July 11, 2024, 1:32pm

@aweech

The ticket has been closed with the below feedback. The document is large (600mb) and has a lot of internal information to be unpacked. And process it in one piece requires 7Gb of heap instead of 4Gb that configured in the code.

But we can advise the following 3 changes and 6Gb will be enough for process.
Gradle config:
minHeapSize = “6G”
maxHeapSize = “6G”

add pdfDocument.close(); at the end of the method com.epiq.discovery.pdf.utils.PDFUtils#pdfSplitToSinglePageAspose
we can add notifier to watch comparison process and memory loading process:
change the following method (com.epiq.discovery.pdf.utils.PDFUtils#pdfMergeImageAspose) :

public static void pdfMergeImageAspose(List<String> singlePageFilePathList, String outFilePath) throws Throwable {
        logger.info("pdfMergeImageAspose Start");

        File outFile = new File(outFilePath);
        if (outFile.exists()) {
            Files.delete(outFile.toPath());
        }

        // Any image files to process??
        if (singlePageFilePathList.size() <= 0) {
            logger.info("pdfMergeImageAspose - No images to process");
        } else {
            // Create PdfFileEditor object
            PdfFileEditor pdfEditor = new PdfFileEditor();
            pdfEditor.setOptimizeSize(true);
            pdfEditor.setIncrementalUpdates(true);
            pdfEditor.setConcatenationPacketSize(100);
            try {
                printMemoryStatus();
                pdfEditor.customProgressConcatenationHandler = new PdfFileEditor.ConcatenationProgressHandler()
                {
                    @Override
                    public void invoke(PdfFileEditor.ProgressEventHandlerInfo eventInfo)
                    {
                        concatenationProgressEvent(eventInfo);
                    }
                };
                pdfEditor.concatenate(singlePageFilePathList.toArray(new String[0]), outFilePath);
            } catch (Throwable t) {
                logger.info(String.format("pdfMergeImageAspose:: error merging documents %s",
                        String.join(" ", singlePageFilePathList.toArray(new String[0]))), t);
                throw t;
            }finally {
            }
        }

        if (!outFile.exists()) {
            throw new IOException("The expected merged pdf file '" + outFile.getAbsolutePath() + "' does not exist after merging.");
        }
        logger.info("pdfMergeImageAspose Done");
        printMemoryStatus();
    }

    private static int countPageConcatenated = 0;
    private static int countDocumentConcatenated = 0;
    private static int countBlankPageConcatenated = 0;

    private void clearCounter()
    {
        countPageConcatenated = 0;
        countDocumentConcatenated = 0;
        countBlankPageConcatenated = 0;
    }
    static void concatenationProgressEvent(PdfFileEditor.ProgressEventHandlerInfo eventInfo)
    {
        switch (eventInfo.EventType)
        {
            case PdfFileEditor.ProgressEventType.PageConcatenated:
                logger.info("Page " + eventInfo.Value + "/" + eventInfo.MaxValue + " from document " +
                        eventInfo.DocumentNumber
                        + " was concatenated");
                countPageConcatenated++;
                break;
            case PdfFileEditor.ProgressEventType.BlankPage:
                logger.info("BlankPage was inserted instead missed page " + eventInfo.Value + " in document "
                        + eventInfo.DocumentNumber + ".");
                countBlankPageConcatenated++;
                break;
            case PdfFileEditor.ProgressEventType.DocumentEmbeddedFiles:
                logger.info("Copy embedded files from document " + eventInfo.DocumentNumber + " was completed.");
                break;
            case PdfFileEditor.ProgressEventType.DocumentForms:
                logger.info("Copy document forms  from document " + eventInfo.DocumentNumber + " was completed" +
                        ".");
                break;
            case PdfFileEditor.ProgressEventType.DocumentOutlines:
                logger.info("Copy document outlines from document " + eventInfo.DocumentNumber + " was " +
                        "completed.");
                break;
            case PdfFileEditor.ProgressEventType.DocumentJavaScript:
                logger.info("Copy JavaScript from document " + eventInfo.DocumentNumber + " was completed.");
                break;
            case PdfFileEditor.ProgressEventType.DocumentLogicalStructure:
                logger.info("Copy document logical structure from document " + eventInfo.DocumentNumber + " " +
                        "was completed.");
                break;
            case PdfFileEditor.ProgressEventType.DocumentConcated:
                logger.info("Document " + eventInfo.DocumentNumber + " with " + eventInfo.MaxValue
                        + " pages was completely concatenated.");
                printMemoryStatus();
                countDocumentConcatenated++;
                break;
            case PdfFileEditor.ProgressEventType.AllPagesCopied:
                logger.info("Copy all pages from document " + eventInfo.DocumentNumber + " was completed.");
                break;
            case PdfFileEditor.ProgressEventType.TotalPercentage:
                logger.info("Total progress percentage is:  " + eventInfo.Value + "%.");
                break;

            default:
                break;
        }
    }

    public static void printMemoryStatus() {

        Runtime rt = Runtime.getRuntime();
        long max = rt.maxMemory() / 1048576;
        long total = rt.totalMemory() / 1048576;
        long free = rt.freeMemory() / 1048576;
        long used = total - free;
        long realFree = max - used;
        logger.info("** Memory status (max / used / free):\n\t" + max + "  /  " + used + "  /  " + realFree);
        logger.info("\t" + new Date() + "\n");
    }
    public static void printMemoryStatus_hard() {
        printMemoryStatus();
        //Forcing GC run with try to creation too large a temporary object in heap.
        int[] oomArray = null;
        try {
            logger.info("Cleaning memory...");
            MemoryCleaner.clearStaticInstances();
            oomArray = new int[Integer.MAX_VALUE - 4];
        } catch (OutOfMemoryError e) {
            //Do nothing
        } finally {
            oomArray = null;
        }
        printMemoryStatus();
    }

We also can advice to concatenate pdf by parts. For example in portions with 50 pages each. Then aggregate every portion into entire document.

Change the following method (com.epiq.discovery.pdf.utils.PDFUtilsTest#pdfMergeImageAspose) :

void pdfMergeImageAspose(String fileName, int pages) throws Throwable {
        ClassLoader classLoader =  Thread.currentThread().getContextClassLoader();
        URL resource = classLoader.getResource(fileName);
        String inputFile = resource.getPath();
        String tempDirectory = System.getProperty("java.io.tmpdir");
        String tempFileLocation = Paths.get(tempDirectory, "data").toString();
        List<String> imagePaths = PDFUtils.pdfSplitToSinglePageAspose(inputFile, tempFileLocation);

        long n = random.nextLong();
        String tempFile = Paths.get(tempDirectory, "data","test"+n+".pdf" ).toString();

//        PDFUtils.pdfMergeImageAspose(imagePaths, tempFile);
        int portion = 50;
        int processPages = 0;
        if (imagePaths.size()<portion){
            logger.info("pdfMergeImageAspose entire document merge Start");
            PDFUtils.pdfMergeImageAspose(imagePaths,tempFile);
        }else{
            List<String> particalyMergedFilePathList = new ArrayList<String>();
            for (int i = 0; i <imagePaths.size()/portion; i++) {
                logger.info("pdfMergeImageAspose portion "+i+" of document merge Start");
                List<String> singlePageFilePathListPortion =
                        imagePaths.subList(processPages, processPages+=portion);
                String outFilePath = tempFile + "_portion_" + i + ".pdf";
                PDFUtils.pdfMergeImageAspose(singlePageFilePathListPortion, outFilePath);
                particalyMergedFilePathList.add(outFilePath);
                logger.info("pdfMergeImageAspose portion "+i+" of document merge End");
            }
            logger.info("pdfMergeImageAspose the last portion of document merge Start");
            List<String> singlePageFilePathListPortion =
                    imagePaths.subList(processPages, imagePaths.size());
            String outFilePath = tempFile + "_portion_" + particalyMergedFilePathList.size()+1 + ".pdf";
            PDFUtils.pdfMergeImageAspose(singlePageFilePathListPortion, outFilePath);
            particalyMergedFilePathList.add(outFilePath);
            logger.info("pdfMergeImageAspose portion "+particalyMergedFilePathList.size()+1+" of document merge End");

            PDFUtils.printMemoryStatus_hard();
            logger.info("pdfMergeImageAspose aggregati
on of portions merge Start");
            PDFUtils.pdfMergeImageAspose(particalyMergedFilePathList,tempFile);
            logger.info("pdfMergeImageAspose aggregation of portions merge End");

        }
        Document doc = new Document(tempFile);
        assertEquals(pages, doc.getPages().size());
    }

    private static final SecureRandom random = new SecureRandom();