Aspose PDF for Java caused OutOfMemory while page traversing

We’ve been hit with a production outage for a week straight caused by what appears to
be an unbounded resource consumption in Page.getPageRect(true) when handed a PDF with
a malformed page tree. Every time the affected document was processed, the JVM heap
ballooned to 64 GB and the Tomcat instance died with GC overhead limit exceeded.
Posting here in the hope someone can confirm whether this is a known issue and whether
24.1+ or a current release handles it safely.

Environment

  • Aspose.PDF for Java: 24.1
  • JDK: Oracle JDK 1.8.0_341
  • Server: Apache Tomcat 9.0.107 on Windows Server 2022
  • Heap: -Xms16g -Xmx64g, G1GC

A heap dump taken just before OOM showed ~560 million live instances of
com.aspose.pdf.internal.l2h.l1v / l2n.l1v retaining ~60 GB. The GC-root path led
straight to the getPageRect/getRect_Rename_Namesake frame above. The mutual recursion
l2n.l1if.lt ↔ l2n.l1if.lf in the trace strongly suggests an unterminated parent-chain
walk while resolving inherited page attributes.

here it’s the Java code that triggers it

byte[] pdfBytes = Helpers.getFileBytes(pdfPath);
try (InputStream pdfInputStream = new ByteArrayInputStream(pdfBytes);
Document doc = new Document(pdfInputStream, password)) {

  for (int i = 0; i < signatures.length(); i++) {
      JSONObject signature = signatures.getJSONObject(i);

      Page page = doc.getPages().get_Item(signature.getInt("pageNumber"));
      Rectangle pageRectangle = page.getPageRect(true);   // <-- OOM here
      double pageWidth  = pageRectangle.getWidth();
      double pageHeight = pageRectangle.getHeight();
      // ... compute signature placement ...
  }

}

Works on millions of PDFs. Dies on the one below.

The malformed PDF

By dumping the byte buffers out of the heap we recovered the offending file. Its page
tree is the root cause:

2 0 obj
<<
/Count 2
/Kids [4 0 R 5 0 R]
/Type /Pages

endobj

4 0 obj
<<
/Type /Page
/MediaBox [0 0 612 792]
/Parent 4 0 R ← page 4 declares ITSELF as its parent
/Contents […]

endobj

5 0 obj
<<
/Type /Page
/MediaBox [0 0 612 792]
/Parent 4 0 R ← page 5’s parent points at a /Page, not the /Pages tree
/Contents […]

endobj

Both pages have a non-conformant /Parent:

  • Page 4 is its own parent — a direct self-loop.
  • Page 5 points to page 4 instead of to the /Pages dictionary (object 2).

Per the PDF spec, MediaBox / CropBox / Rotate are inherited from /Pages ancestors, so
getPageRect(true) has to walk the parent chain. On this document the walk is
non-terminating (or terminates only after generating enormous intermediate state,
given the millions of l2n.l1v instances retained).

The PDF was produced by PDFium (Chrome / Edge “Save as PDF”) and then annotated/signed
in Apple Preview / iOS Markup. It opens and renders fine in Adobe
Reader, Preview, and Chrome so it looks valid to end users, but it kills Aspose’s
geometry resolver.

Questions for Aspose support

  1. Is Page.getPageRect(true) expected to terminate cleanly when the page tree contains
    a /Parent self-loop or non-/Pages parent?
  2. Is this fixed in a newer release?
  3. Is there an officially recommended way to sanitize / repair a page tree before
    passing the Document to Aspose APIs? Or a flag on LoadOptions that forces tolerant
    parsing?
  4. Short of an upgrade, is there a public API to detect a self-referential or
    non-/Pages /Parent before calling getPageRect?

we were able to stop the OOM, but I’d strongly prefer for Aspose to fail fast
(e.g. throw InvalidPdfFileFormatException) on a cyclic page tree rather than try to
resolve it.

Thanks, this brought production down repeatedly for over 4 months now before we identified
the parent-cycle as the trigger, so any guidance on the correct long-term fix is
hugely appreciated.

@Eslam_Hawas

Hello,
Thank you for the exceptionally detailed report — the heap dump, the recovered file, and the parent-cycle analysis made this straightforward to confirm. Your diagnosis is
correct.

Confirmed. getPageRect(true) resolves the inherited page attributes (MediaBox/CropBox/Rotate, which the PDF spec inherits from /Pages ancestors) by walking the page’s
/Parent chain. That walk is implemented as a mutual recursion over the page-tree nodes and has no cycle detection. On a normal acyclic tree it terminates at the document
/Pages root; on your file it doesn’t:

  • Page 4’s /Parent points at itself — a direct self-loop.
  • Page 5’s /Parent points at page 4 (a /Page, not the /Pages dictionary).

So the resolver never reaches a root and keeps allocating per-node state, which is the flood of internal instances you saw retaining ~60 GB until the heap is exhausted. This
matches your l1if.lt ↔ l1if.lf recursion and the retained node objects exactly.

This is not specific to 24.1. We checked the current release and the inherited-attribute resolution still walks /Parent the same way, with no guard against a cyclic or
non-/Pages parent — so upgrading alone will not resolve this particular case. We agree with your preferred behavior: the library should fail fast (e.g.
InvalidPdfFileFormatException) on a cyclic page tree rather than attempt to resolve it. We’re logging this as a defect on that basis; we’ll follow up here with the issue ID.
On your specific questions:

  1. Should getPageRect(true) terminate on a self-loop / non-/Pages parent? It should, and today it does not — this is the bug.
  2. Fixed in a newer release? Not yet; the current build behaves the same. We’re filing it.
  3. Tolerant-parsing / repair option? There’s no LoadOptions flag that addresses a cyclic /Parent (the WarningCallback won’t fire for this). Document.repair() may normalize a
    malformed page tree — it’s worth testing against your recovered file before calling getPageRect, though we’d want to validate it rebuilds the parent links on a sample like
    yours.
  4. Public API to pre-detect a self-referential /Parent? There isn’t a clean high-level accessor — Page doesn’t expose its /Parent. Until the fail-fast fix lands, a
    pre-validation pass over the page tree with a separate parser (rejecting any page whose /Parent is itself or is not the /Pages dictionary) is the most reliable guard, which
    aligns with the mitigation you’ve already deployed.

If you can attach the recovered sample to this thread, we’ll include it directly in the defect report so the fix is verified against the exact structure that hit you.

Apologies that this cost you repeated production impact — the fail-fast behavior you’re asking for is the right long-term fix and we’ll push for it.

Hi @ilyazhuykov ,

Thank you for the prompt reply and for confirming the defect. I’m glad the detailed analysis was helpful in pinpointing the root cause.

Unfortunately, I cannot attach the recovered sample file, as it is a highly confidential document.

However, since the exact page tree structure causing the infinite loop is outlined in my original post (specifically Page 4 acting as its own parent, and Page 5 pointing to Page 4 instead of the /Pages dictionary), I believe your engineering team should be able to construct a minimal synthetic PDF with this exact malformation to test and verify the fix.

If your team needs any other specific metadata or structural details from the file, please let me know and I can extract them manually without sharing the document itself.

Thank you again for your support and for prioritizing the fail-fast behavior.