Convert PDF to DOCX in Java - Memory leak while converting files

Test_PDF_XY.pdf (68.2 KB)

Aspose Team,

We’re trying to convert color PDF files to grayscale in java. Because there is a bug in the recommended solution (Grayscaling PDFs causes textboxes to get blacked-out - #4 by jmuth), we convert in a roundabout way: first convert pdf to docx, and then convert docx to grayscale pdf. However, with each file or page converted, the memory usage of the program increases a little bit. And after a large amount of files or pages converted, the machine runs out of memory and the program crashes. The machine is Ubuntu 18.04 with 4GB memory.

Following is the sample code for reproducing the problem (I can upload the sample file later). Please advise on whether there is some type of adjustment that can be made to avoid the memory leak. Note that we tried the com.aspose.pdf.MemoryCleaner.clear() method, but it crashed our program. So we do not use it now. Also note that some temporary files are created by Aspose and they are not deleted until the program exit. So we need to manually delete them to avoid running out of disk space.

Thank you for your help.

Xiaohong

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class ConvertPdfGray {
public static void main(String args[]) {
setSystemSettings();
System.out.println(“Start”);
String inputFile = “/home/ubuntu/convertpdfgray/data/Test_PDF_XY.pdf”;
String outputFile = “/home/ubuntu/convertpdfgray/data/Test_PDF_XY_gray.pdf”;
try {
int nloops = 6000;
if (args.length > 0) {
nloops = Integer.parseInt(args[0]);
}

        System.out.println("Input file: " + inputFile);
        System.out.println("Output file: " + outputFile);
        System.out.println("nloops: " + nloops);

        // Convert pdf file to grayscale
        for (int i=0; i<nloops; i++) {
            convertPdfToGrayscale(inputFile, outputFile);
            cleanupAsposeTemporaryFiles();
            int count = i+1;
            if(count % 100 == 0 || count == nloops) {
                System.out.println(String.format("Converted %d files", count));
            }
        }
    }
    catch(Throwable t) {
        t.printStackTrace();
    }
    finally {
        cleanupAsposeTemporaryFiles();
    }

    System.out.println("Done.");
    System.out.println("Press enter to exit");
    System.console().readLine();
}

public static void convertPdfToGrayscale(String pdfFile, String grayPdfPath) throws Exception {
    // Load pdf file if not loaded yet
    com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfFile);

    // Save pdf to docx
    String tmpDocxPath = pdfFile.replace(".pdf", ".docx");
    com.aspose.pdf.PdfSaveOptions saveOptions = new com.aspose.pdf.PdfSaveOptions();
    pdfDocument.save(tmpDocxPath, com.aspose.pdf.SaveFormat.DocX);
    pdfDocument.close();

    // Load docx file
    com.aspose.words.Document wordDocument = new com.aspose.words.Document(tmpDocxPath);

    // Save docx back to pdf with Grayscale option
    com.aspose.words.PdfSaveOptions pdfSaveOptions = new com.aspose.words.PdfSaveOptions();
    pdfSaveOptions.setColorMode(com.aspose.words.ColorMode.GRAYSCALE);
    pdfSaveOptions.setSaveFormat(com.aspose.words.SaveFormat.PDF);
    pdfSaveOptions.setMemoryOptimization(true);
    wordDocument.save(grayPdfPath, pdfSaveOptions);
    wordDocument.cleanup();

    // Delete temporary docx file and output gray pdf file
    try {
        Files.deleteIfExists(Paths.get(tmpDocxPath));
        Files.deleteIfExists(Paths.get(grayPdfPath));
    } catch (IOException e) {
        e.printStackTrace();
    }
}

public static void cleanupAsposeTemporaryFiles()
{
    //System.out.println("Clean up Aspose temporary files...");
    List<String> tmpFiles = Stream.of(new File("/tmp").listFiles())
            .filter(file -> !file.isDirectory() && file.getName().startsWith("aspose_"))
            .map(File::toString)
            .collect(Collectors.toList());
    cleanup(tmpFiles);
}

public static void cleanup(List<String> destFileList) {
    //System.out.println(String.format("cleanup %d files", destFileList.size()));
    for (String s : destFileList) {
        try {
            Files.deleteIfExists(Paths.get(s));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

private static void setSystemSettings() {
    try {
        String asposeLicense = "/home/ubuntu/QmulusWorker/required/Aspose.Total.Java.lic";
        new com.aspose.words.License().setLicense(asposeLicense);
        new com.aspose.pdf.License().setLicense(asposeLicense);
        System.out.println("License loaded successfully.");
    } catch (Exception e) {
        System.out.println("Failed to load the license: " + e.getMessage());
        System.exit(-1);
    }
}

}

@xyang,

Can you please share source file with us so that we may further investigate to help you out.

Uploaded the sample file (Test_PDF_XY.pdf).

@xyang,

Can you please share complete environment details along with which version of Apose.PDF you are using.

@Adnan.Ahmad
Here are some details:

~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic

~$ free -m
total used free shared buff/cache available
Mem: 3885 2478 205 1 1201 1190
Swap: 0 0 0

~$ java -version
openjdk version “1.8.0_232”
OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

Aspose.PDF and Aspose.Words versions are 19.11:
aspose-pdf-19.11.jar
aspose-words-19.11-jdk17.jar

Let me know if you need for information.

Thanks
Xiaohong

@xyang,

Can you please try to use Aspose.PDF latest version on your end and share feedback with us.

I tested with Aspose.PDF 19.12 and Aspose.Words 19.12 and the problem still happens.

@xyang,

I have observed your issue and like to inform that I have created investigation ticket with ID PDFJAVA-39081 in our issue tracking system to investigate and resolve this issue as soon possible.

Has this issue been resolved?

@aweech

We are afraid that the issue could not get resolved yet. We have revived it by logging your concerns and will surely inform you as soon as we make some progress towards its resolution. We are sorry for the inconvenience.