PDF to Image Java using Aspose.PDF - OutOfMemoryException

UpperVolta · February 19, 2021, 7:04pm

I have numerous PDF files that cause OOM exceptions. I am trying to generate images from the PDF using the following code:

public static void main( String[] args ) {

    String fileName = "test.pdf";
    if ( args.length > 0 ) {
        fileName = args[0];
    }
    File pdfFile = new File( fileName );
    Document doc = null;
    InputStream in = null;
    try {
        in = new FileInputStream( pdfFile );
        doc = new Document( in );
        com.aspose.pdf.facades.PdfConverter converter = new com.aspose.pdf.facades.PdfConverter();
        converter.bindPdf( doc );

        converter.setStartPage( 1 );
        //converter.setEndPage( doc.getPages().size() );
        int pageNum = 1;
        while ( converter.hasNextImage() ) {

            System.out.println( "Generating slides for page " + pageNum );
            Page page = doc.getPages().get_Item( pageNum );
            System.out.println( "Got Page element for page " + pageNum );
            String fullFileName = "full_" + pageNum + ".png";
            OutputStream fullStream = new FileOutputStream( fullFileName );
            System.out.println( "Calling getNextImage" );
            converter.getNextImage( fullStream, ImageType.getPng() ); //jpg", ImageType.getJpeg() , 100, 150, 100);
            System.out.println( "Returned from calling getNextImage" );
            pageNum++;
        }
    } catch ( FileNotFoundException e ) {
        e.printStackTrace();
    }
}

Java version is 1.8. Environment is Linux/64 and Windows Server 2016. I have tried changing the Xmx setting to 6GB which helped in some cases, but not enough.
Strangely, I have a mac where this runs without any additional settings and is able to convert the pdfs. Also 1.8.

asad.ali · February 20, 2021, 7:01pm

@UpperVolta

Would you please make sure that you are using the latest version of the API. In case issue is still persisting, please let us know if it is occurring with certain PDF files of large sizes OR with any PDF file? Please share some sample PDF document(s) with us so that we can test the scenario in our environment and address it accordingly.

UpperVolta · February 20, 2021, 11:18pm

I am using version 21.1. If you have something more recent, I can try it.

I am having problems like this with many PDF files. All of them seem to take an extraordinary amount of memory and CPU to process. I am running a 6GB heap and this has allowed me to process most of the files under 15MB in size, but I’m still running out with some that are around 20 MB.

I can send some samples, but they belong to our customer, can you keep them private?

I am also open to using the API differently if you think that would help. What we are trying to do is generate images of each page in the PDF. Sometimes this is just one page. We’re doing so using a dedicated thread pool so we only have one of these per JVM running at a time, although concurrency isn’t related to our problems here.

asad.ali · February 21, 2021, 6:27pm

@UpperVolta

You can please try the DOM approach to convert PDF to Image using Aspose.PDF for Java. However, please share sample PDF file with us in case you face some issue with mentioned approach. We assure you that we only use the files for investigation purpose and once investigation is complete, we erase them from our system. We also do not disclose your files with any one.

Kurt_Mehlhoff · February 22, 2021, 4:00pm

I can try that. When declaring the rectangle I would like the entire page. How do I know the dimensions of the page?

How should I share the file with you? It is too large for email.

asad.ali · February 22, 2021, 10:29pm

@UpperVolta

You do not need to specify any rectangle while converting PDF Page to Image using DOM approach. However, in case you need to get the page dimension, you can use Page.getRect() method.

UpperVolta · February 24, 2021, 7:02pm

I did try that, but it had no effect on memory consumption.

asad.ali · February 24, 2021, 9:50pm

@UpperVolta

Could you please share your sample input file with us with sample code snippet that you are using. We will test the scenario in our environment and address it accordingly.

Kurt_Mehlhoff · February 24, 2021, 10:19pm

How? The file size is more than the limit you allow to upload.

Here is the code:

import com.aspose.pdf.*;import com.aspose.pdf.devices.BmpDevice;
import com.aspose.pdf.devices.Resolution;

import javax.print.Doc;
import java.io.*;
import java.text.SimpleDateFormat;
import java.util.Date;

public class PDFConvert {

static License license = null;

static {
System.setProperty( “java.awt.headless”, “true” );

// Here’s how to read the license file in, according to Aspose.
if ( license == null ) {
license = new License();

InputStream fstream = null;
try {
fstream = getClassPathResourceAsStream( “Aspose.Total.Java.lic” );
if ( fstream == null ) {
log( "Unable to read license file: " + “Aspose.Total.Java.lic” );
}

license.setLicense( fstream );

} catch ( Exception ex ) {
System.out.println( ex );
} finally {
try {
if ( fstream != null )
fstream.close();
} catch ( IOException ioe ) {
System.out.println( ioe );
}
}
}
}

public static InputStream getClassPathResourceAsStream( String fileName ) {
InputStream in = PDFConvert.class.getClassLoader().getResourceAsStream( fileName );
if ( in == null ) {
//Try to load it with prepending slash
log( “Could not find” + fileName + " trying to find it by prepending slash." );
in = PDFConvert.class.getClassLoader().getResourceAsStream( “/” + fileName );
}
return in;
}

public static void generateSlidesConverter( String fileName ) {
File pdfFile = new File( fileName );
Document doc = null;
InputStream in = null;
log( “Starting generate slides via converter” );
try {
in = new FileInputStream( pdfFile );
doc = new Document( in );
com.aspose.pdf.facades.PdfConverter converter = new com.aspose.pdf.facades.PdfConverter();
converter.bindPdf( doc );

converter.setStartPage( 1 );
//converter.setEndPage( doc.getPages().size() );
int pageNum = 1;
while ( converter.hasNextImage() ) {

log( "Generating slides for page " + pageNum );
Page page = doc.getPages().get_Item( pageNum );
log( “Got Page element for page " + pageNum );
String fullFileName = “full_” + pageNum + “.png”;
OutputStream fullStream = new FileOutputStream( fullFileName );
log( “Calling getNextImage” );
converter.getNextImage( fullStream, ImageType.getPng() ); //jpg”, ImageType.getJpeg() , 100, 150, 100);
log( “Returned from calling getNextImage” );
pageNum++;
}
} catch ( FileNotFoundException e ) {
e.printStackTrace();
}
log(“Done generating slides vis converter”);
}

public static void main( String[] args ) {

String fileName = “test.pdf”;
if ( args.length > 0 ) {
fileName = args[0];
}

generateSlidesConverter( fileName );
generateSlidesDOM( fileName );
}

private static void generateSlidesDOM( String pdfFile ) {
log(“Starting generate slides via DOM”);
Document document = new Document( pdfFile );
Rectangle pageRect = document.getPages().get_Item( 1 ).getRect();
log(“Have Page Rect”);
// Get rectangle of particular page region
//Rectangle pageRect = new Rectangle( 20, 671, 693, 1125 );
// set CropBox value as per rectangle of desired page region
document.getPages().get_Item( 1 ).setCropBox( pageRect );
// save cropped document into stream
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
document.save( outStream );
log(“Saved doc, creating new one”);
// open cropped PDF document from stream and convert to image
document = new Document( new ByteArrayInputStream( outStream.toByteArray() ) );
// Create Resolution object - I have no idea what this does
Resolution resolution = new Resolution( 100 );
// Create BMP device with specified attributes
BmpDevice bmpDevice = new BmpDevice( resolution );
// Convert a particular page and save the image to stream
log(“Processesing device”);
bmpDevice.process( document.getPages().get_Item( 1 ), “Output.bmp” );
log(“Saved image - done with generate slides”);
}
public static void log( String msg ) {
SimpleDateFormat sdf = new SimpleDateFormat(“hh.mm.ss”);
String ts = sdf.format( new Date());
System.out.println(ts+"\t"+msg);
}
}

asad.ali · February 25, 2021, 5:08am

@Kurt_Mehlhoff

You can please upload the sample file to Google Drive or Dropbox and share the link with us.

Kurt_Mehlhoff · March 1, 2021, 9:01pm

Here is one of the PDF files which cause problems.
https://www.dropbox.com/s/k2c55onccmj0ctv/727.pdf?dl=0

asad.ali · March 2, 2021, 5:39pm

@UpperVolta

We were able to reproduce the issue in our environment while using Aspose.PDF for Java 21.2 and the following code snippet:

Document pdfDocument = new Document(dataDir + "727.pdf");
for(Page page:pdfDocument.getPages()) {
 java.io.OutputStream imageStream = new java.io.FileOutputStream(dataDir + "Converted_Image_"+page.getNumber()+".png");
 com.aspose.pdf.devices.Resolution resolution = new com.aspose.pdf.devices.Resolution(100);
 com.aspose.pdf.devices.BmpDevice pngDevice = new com.aspose.pdf.devices.BmpDevice(resolution);
 pngDevice.process(page, imageStream);
 imageStream.close();
}

Therefore, we have logged it as PDFJAVA-40232 in our issue tracking system. We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · July 9, 2021, 6:50pm

@UpperVolta

We tried to reproduce the current issue and got equals results for all environments mentioned by the you. (MacOS 11.4, Windows 10 Pro, Linux (in the docker)). All cases of this code snippet were run with option: -Xmx4G and all cases were completed successfully. Please provide additional details so that we can reproduce this problem.

The issue has been verified, and OOM does not reproduce on either the 6GB memory stack or the 4GB. But the processing time with 6GB took 5 minutes for conversion, but with 4Gb it took 19 minutes. The document is very complicated and has a lot of objects to process. And this is expected behavior, that Garbage Collector spends a lot of time for releasing unused instances in an environment with a lack of memory.

Kurt_Mehlhoff · January 24, 2022, 11:42pm

Yes. The PDF belongs to a customer and is not for public use. Should I email it to you?

I can either post the source or email that as well.

asad.ali · January 25, 2022, 3:54pm

@UpperVolta

We believe that your message was about the other query which you posted recently i.e. PDF to Image Conversion results in endless memory consumption. We are sending you a private message and you can share your file in reply to that message so that we will proceed to assist you accordingly.

Kurt_Mehlhoff · January 25, 2022, 4:07pm

Already done

asad.ali · January 25, 2022, 7:04pm

@UpperVolta

We were able to reproduce the issue in our environment while testing the scenario with Aspose.PDF for Java 22.1. Therefore, this has been logged as PDFJAVA-41255 in our issue tracking system. We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

UpperVolta · April 18, 2022, 3:18pm

Any progress on this issue?

asad.ali · April 18, 2022, 8:57pm

@UpperVolta

Regretfully, the earlier logged ticket could not get resolved due to other issues in the queue. We will surely fix the issue on first come first serve basis and notify you via this forum thread as soon as we have more updates about ticket resolution. Please spare us some time.

We apologize for your inconvenience.

UpperVolta · January 30, 2023, 10:48pm

Any progress on this issue? Is your inaction an indication that you have no plans to address the issue?