Traversing pages in PDF caused OOM error

rye3000 · May 19, 2021, 2:51am

We are using Aspose PDF Java lib 21.3 to traverse a document and all its pages to store paragraph text. We allocated 2GB memory for the process, the document is about 8M bytes with a bit over 3000 pages. It seems we have to allocated 4GB memory for this process to be complete in over 1 hour

The process ran a while and quickly ended with Out of Memory error. From what I see during the profiling, it looks like the PDF document maintains a list of generic objects, and that is the probably the root cause of OOM, even if we call page.freeMemory

The following the code snippet that caused OOM.

    Document doc = new Document(documentPath);
    int limit = doc.getPages().size();
    int id = 0;

    for (int page = 1; page <= limit; page++) {
        ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber();
        Page pageObject = doc.getPages().get_Item(page);
        paragraphAbsorber.visit(pageObject);

        for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
            for (MarkupSection section : markup.getSections()) {
                for (MarkupParagraph paragraph : section.getParagraphs()) {
                    String text = paragraph.getText();
                    repository.save(new Extract(documentId, page, text, id));
                    id++;
                }
            }
        }
        pageObject.freeMemory();

    }

Is there something we did wrong? For a 8M bytes PDF with 4GB memory, it does not scale well for multiple documents.

mudassir.fayyaz · May 19, 2021, 11:46am

@rye3000

Can you please share the source PDF file so that we may try to reproduce the same on our end.

rye3000 · May 19, 2021, 8:33pm

Listings_TamiFlu_966_4142.pdf (7.8 MB)

mudassir.fayyaz · May 19, 2021, 9:19pm

@rye3000

Your code does not include definition for the below line but I have noticed high usage of resources to process your PDF file. A ticket with ID PDFJAVA-40503 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.

repository.save(new Extract(documentId, page, text, id));

rye3000 · May 20, 2021, 1:27pm

that line is a call to DB. You can simply remove that line with a println() statement. You can even remove the following loops and still reproduce the issue:
for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
for (MarkupSection section : markup.getSections()) {
for (MarkupParagraph paragraph : section.getParagraphs()) {
String text = paragraph.getText();
repository.save(new Extract(documentId, page, text, id));
id++;
}
}
}

mudassir.fayyaz · May 20, 2021, 11:02pm

@rye3000

Thanks for the information. We have recorded your feedback under the same ticket for our reference.

gwbradley · June 9, 2021, 1:11am

Is there any update on this ticket?

mudassir.fayyaz · June 9, 2021, 11:06am

@gwbradley

Please note that it was recently logged in free support model and will be investigated and resolved on a first come first serve basis. We will surely inform you as soon as we make some definite progress towards its resolution. Please be patient and spare us some time.