We are using Aspose PDF Java lib 21.3 to traverse a document and all its pages to store paragraph text. We allocated 2GB memory for the process, the document is about 8M bytes with a bit over 3000 pages. It seems we have to allocated 4GB memory for this process to be complete in over 1 hour
The process ran a while and quickly ended with Out of Memory error. From what I see during the profiling, it looks like the PDF document maintains a list of generic objects, and that is the probably the root cause of OOM, even if we call page.freeMemory
The following the code snippet that caused OOM.
Document doc = new Document(documentPath);
int limit = doc.getPages().size();
int id = 0;
for (int page = 1; page <= limit; page++) {
ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber();
Page pageObject = doc.getPages().get_Item(page);
paragraphAbsorber.visit(pageObject);
for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
for (MarkupSection section : markup.getSections()) {
for (MarkupParagraph paragraph : section.getParagraphs()) {
String text = paragraph.getText();
repository.save(new Extract(documentId, page, text, id));
id++;
}
}
}
pageObject.freeMemory();
}
Is there something we did wrong? For a 8M bytes PDF with 4GB memory, it does not scale well for multiple documents.