Aspose.PDF Java 27.3:jdk17 JpegDevice.process is really slow and so is loading a document

I am running some performance tests to compare Aspose to another PDF library that we use and I have found it to be much slower when it comes to opening documents and saving them as images. Here is the test I am using to time loading a simple 1 page pdf and converting it to jpeg and tiff. Compared to the library we are trying to replace this is 3-4x slower, which is not acceptable. I have tried using streams, optimizing and different resolution/quality, but still not much difference. Are these expected times for these operations? Any information or help would be appreciated.

@Test
public void testAsposeLoadDocSaveImage() {
    long startTime = System.nanoTime();
    Document pdfDocument = new Document("src/test/resources/simple.pdf");
    long endTime = System.nanoTime();
    long duration = TimeUnit.NANOSECONDS.toMillis(endTime - startTime);
    System.out.println("Time taken to open the doc: " + duration + " ms");

//        pdfDocument.optimize();
//        OptimizationOptions optimizationOptions = new OptimizationOptions();
//        optimizationOptions.setImageEncoding(ImageEncoding.Flate);
//        optimizationOptions.setMaxResoultion(300);
//        pdfDocument.optimizeResources(optimizationOptions);
    Page pdfDocumentPage = pdfDocument.getPages().get_Item(1);
    Resolution resolution = new Resolution(300);
    JpegDevice jpegDevice = new JpegDevice(resolution, 80);

    startTime = System.nanoTime();
    jpegDevice.process(pdfDocumentPage, "simple.jpeg");
    endTime = System.nanoTime();
    duration = TimeUnit.NANOSECONDS.toMillis(endTime - startTime);
    System.out.println("Time taken to process the jpeg: " + duration + " ms");

    File file = new File("simple.jpeg");
    assertTrue(file.isFile());
    file.delete();

    TiffSettings tiffSettings = new TiffSettings();
    tiffSettings.setSkipBlankPages(true);
    tiffSettings.setCompression(CompressionType.CCITT4);
    TiffDevice tiffDevice = new TiffDevice(resolution, tiffSettings);

    startTime = System.nanoTime();
    tiffDevice.process(pdfDocument, 1, 1, "simple.tif");
    endTime = System.nanoTime();
    duration = TimeUnit.NANOSECONDS.toMillis(endTime - startTime);
    System.out.println("Time taken to process the tiff: " + duration + " ms");

    file = new File("simple.tif");
    assertTrue(file.isFile());
    file.delete();
}

Time taken to open the doc: 1226 ms
Time taken to process the jpeg: 3724 ms
Time taken to process the tiff: 1236 ms

I am using a simple pdf that contains only text “This is a test”

@ehenry120

The performance of the API is calculated on the multiple runs as well as in the release mode. At first load and execution, the API loads necessary resources into memory which causes some delay in processing. However, subsequent executions are definitely faster. Have you tried running the same program multiple times to check how much time it is taking in each run?

@asad.ali Thank you for your response. I did suspect some lazy loading etc…so I ran the test repeated and found great results:

Time taken to process the jpeg0: 3500 ms
Time taken to process the tiff0: 735 ms
Time taken to open the doc 1: 25 ms
Time taken to process the jpeg1: 554 ms
Time taken to process the tiff1: 726 ms
Time taken to open the doc 2: 23 ms
Time taken to process the jpeg2: 436 ms
Time taken to process the tiff2: 466 ms
Time taken to open the doc 3: 15 ms
Time taken to process the jpeg3: 439 ms
Time taken to process the tiff3: 442 ms
Time taken to open the doc 4: 18 ms
Time taken to process the jpeg4: 429 ms
Time taken to process the tiff4: 371 ms
Time taken to open the doc 5: 11 ms
Time taken to process the jpeg5: 501 ms
Time taken to process the tiff5: 347 ms
Time taken to open the doc 6: 10 ms
Time taken to process the jpeg6: 431 ms
Time taken to process the tiff6: 346 ms
Time taken to open the doc 7: 16 ms
Time taken to process the jpeg7: 384 ms
Time taken to process the tiff7: 327 ms
Time taken to open the doc 8: 19 ms
Time taken to process the jpeg8: 389 ms
Time taken to process the tiff8: 315 ms
Time taken to open the doc 9: 9 ms
Time taken to process the jpeg9: 393 ms
Time taken to process the tiff9: 312 ms

The problem we are facing is that we are running this on AWS Lambda and we split a document into pages in attempt to go wide with parallel processing. Since Aspose needs to be primed before it can perform at it’s best, and we are only processing 1 page docs, every invocation takes a hit on the overhead of loading Aspose. We have discussed splitting a multi-page doc into batches of smaller multi page docs instead of single pages to help with this issue, then each invocation processes more than 1 page. The problem is we have 50% of our documents are 1 page, so we cannot batch those to one invocation easily at this time. Other options discussed are the use of a service instead of lambda, that way Aspose is always ready and loaded. Do you have any other suggestions to get around this?

@ehenry120

Have you given a thought of using Aspose.PDF Cloud API? It would also make the lambda function deploy package lighter. You can try and evaluate it for performance. In case you are not interest in using Cloud API, please let us know. We will try to analyze this scenario by generating a ticket in our issue management system.

I looked into the Cloud API and we would need to do the self hosted if anything. There would be a high volume of requests and it needs to interact with s3 to get files. We looked at a few things to optimize the lambda and keep the Aspose Objects alive between invocations. Instead of creating new JpegDevice or TiffDevice each time a request is processed, I am creating them once globally and re-using them, but still we cannot get the time to process a 1 page doc with just text in under 2 seconds. It should be around 1 second, like if I run multiple pages i get close to 1 second on the subsequent pages like the results below:

Time taken to open the doc 0: 1624 ms
Time taken to process the tiff0: 919 ms
Time taken to open the doc 1: 55 ms
Time taken to process the tiff1: 796 ms
Time taken to open the doc 2: 14 ms
Time taken to process the tiff2: 532 ms

With the lambda we never get those times because Aspose needs to load each invocation and we take on the overhead. If we run the cloud API self hosted and just call that for everything then we would not take on the overhead, but that is a major architectural change. Also is there any risk of cross-contaminating data when re-using the Document object or Jpeg or Tiff Device? If I create the document object globally and I re-initialize it each time I read a doc is there any risk of data being kept between document loads?

@ehenry120

If you are re-initializing the Document object if it is declared globally, you can be sure that it would not mix the content of the previously loaded/initialized document. The Document.Save() method works like Dispose() as well and it clears out every resources from the memory which were loaded for that particular file. Therefore, re-initializing the Document object won’t do any harm.

About the TiffDevice or other Device Classes, we believe they don’t take much of the resources if you initialize them with new object. Furthermore, we would need some analysis at our end to answer you about these classes. Is it possible if you can provide some sample files and code snippet for us to work with and observe the issue you are seeing at your end? We will further proceed accordingly.