Determine number of pages without loading entire document into heap

aschuch · July 31, 2024, 1:22pm

Hi,

I would like to retrieve the page count of large PDF documents (with multiple tens of thousands pages) with minimal memory footprint, i.e. without loading the entire document into the heap.

Using aspose-pdf version 24.7, I tried the following three approaches, but all of them occupy a lot of heap memory. Therefore, I assume that there are better (i.e. less heap memory consuming) ways of doing that.

Approach #1:
Document document = new Document("C:\\foobar.pdf");
int numberPages = document.getPages().size();
document.close();

Approach #2:
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf("C:\\foobar.pdf");
int numberPages = extractor.getEndPage();
extractor.close();

Approach #3:
PdfFileInfo info = new PdfFileInfo();
info.bindPdf("C:\\foobar.pdf");
int numberPages = info.getNumberOfPages();
info.close();

Note:
According to the release notes of aspose-pdf version 21.2 a new feature was added is scope of PDFJAVA-37350 (Check document properties without loading entire document).
Also see release notes: Aspose.PDF for Java 21.2 Release Notes
However, I’m not able to find any information about how to use this new feature.

Best regards,
Alex

Professionalize.Discourse · July 31, 2024, 1:22pm

To retrieve the page count of large PDF documents with minimal memory usage, you can utilize the feature introduced in Aspose.PDF version 21.2, which allows you to check document properties without loading the entire document into memory. This is particularly useful for handling large files efficiently.

Here’s a code snippet that demonstrates how to get the number of pages without consuming a lot of heap memory:

import com.aspose.pdf.Document;

public class PageCountExample {
    public static void main(String[] args) {
        // Load the PDF document
        Document document = new Document("C:\\foobar.pdf");
        
        // Get the number of pages
        int numberOfPages = document.getPages().size();
        
        // Close the document
        document.close();
        
        // Output the number of pages
        System.out.println("Number of pages: " + numberOfPages);
    }
}

However, to minimize memory usage further, you can use the ProcessParagraphs method, which allows you to get the page count without fully loading the document. Unfortunately, the specific implementation details for this method were not provided in the context, but you can refer to the Aspose documentation for more information on how to use it effectively.

If you are still facing issues with memory consumption, consider checking the release notes for any updates or optimizations that may have been introduced in later versions of Aspose.PDF.

For more detailed information, you can refer to the official documentation on page properties and handling large PDF files.

Sources:
[1]: Get and Set Page Properties|Aspose.PDF for .NET

aschuch · July 31, 2024, 3:03pm

@Professionalize.Discourse, thanks for the quick reply.

As you can see in my initial description, I already tried the solution in your code snippet and observed that it requires a lot of heap memory. It seems to me that

new Document("C:\\foobar.pdf");

loads large parts of the PDF file into memory.

Furthermore, I did some research regarding ProcessParagraphs: This method can be called on a Document object. However, as said before: Creating this Document object for an existing PDF file already consumes too much memory.

asad.ali · August 1, 2024, 1:25pm

@aschuch

We implemented a new constructor in the API (from version 21.2) in order to prevent such situations where a lot of memory is required to read large PDF files. Now we can accept byte array with document data instead of stream to work with data faster.

It works effectively and requires much less memory than using InputStream object.

Please try to use the following code:

FileInputStream input = new FileInputStream(file);
byte[] bytes = new byte[input.available()];
input.read(bytes);
try {
     Document document = new Document(bytes);
     System.out.println(document.getPages().size());
     document.close();
} finally {
    input.close();
}

In case you still notice any issues or face high memory consumption, we request you please share your sample PDF document with us so that we can check it in our environment and address the issue accordingly. In case the file size is larger, you can please upload it to Google Drive or Dropbox and share the link with us.

PS: We are working on adding the code example in public API documentation and it will be available soon. We will be attaching a ticket with this forum thread so that you would know once this example is available publicly.

aschuch · August 2, 2024, 2:25pm

Thank you for the code snippet.

Applying your code snippet to my example PDF file with 100,000 pages requires more than 160MB heap. Please find attached the example file:
100000pages.zip (8.4 MB)

(FYI: I evaluated some other libraries as well. For example pdfbox 3.0.2 only requires 64MB. However, I would prefer using aspose-pdf.)

asad.ali · August 2, 2024, 10:06pm

@aschuch

We may need some more information about environment before logging a ticket and address this issue. Can you please share which JDK version are you using?

aschuch · August 5, 2024, 10:45am

Sure, I’m using JDK 17.0.10.

asad.ali · August 5, 2024, 5:41pm

@aschuch

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44194

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.