The page tree nodes in a PDF should be written as a balanced tree instead of a flat list

After an investigation on the output of a PDF generation run, I discovered, from examination of the internal structure of the generated PDF, that the page tree nodes were being written out as a single list of leaf nodes instead of as a balanced tree of nodes. As long as page count is relatively low, this matters little but as the page count rises, it takes noticeably longer to access pages, the further towards the end of the file they are. Being defined as a flat list forces Adobe Reader (and others), when reading pages, to have to start at the beginning of the page list and count of pages until the correct one is found; the higher the page number, the longer this process takes.


By placing the nodes in a tree, only the branch nodes that can contain the page required need to be read, much like performing a search on a database BTree.

See sections “3.6.2 Page Tree” and “G.4 Page Tree Example” in the PDF file listed below:-

Hi there,


Thanks for sharing your findings and suggestion. We have logged a ticket PDFNEWNET-39116 in our issue tracking system for further investigation of the highlighted issue. We will look into it and let you know our findings asap.

Best Regards,

Has there been any update to this? The unbalanced page tree causes huge performance issues in large PDFs that are being served over the network.

@johnhok

Sorry for the delayed response.

We regret to share that the earlier logged ticket could not get resolved. However, we have updated its information and will surely inform you as soon as it is fixed.

We apologize for the inconvenience.

@asad.ali thanks for your response! That’s unfortunate about the original ticket. Just wanted to add additional colour to the issue for the Aspose team. Even the PDF reference specifications recommend having the page tree balanced in order to optimize overall application performance.

Screen Shot 2021-05-12 at 8.35.24 AM.png (32.0 KB)

I’ve been able to see first hand how this issue hinders PDF performance and that’s why I was hoping to try to surface it again after the OP. Appreciate your help in updating the ticket with more information. I’ll keep an eye out on this thread for any updates.

@johnhok

Thanks for sharing your feedback.

The ticket information has been updated accordingly along with your concerns and we will surely update you once it is fixed. We apologize for the inconvenience caused due to it. Please give us some time.

The issues you have found earlier (filed as PDFNET-39116) have been fixed in Aspose.PDF for .NET 22.11. The Document.PageNodesToBalancedTree method was added.