How to limit memory usage when extracting text from large PDFs?

barryh · May 19, 2013, 2:12pm

We are using Aspose.Pdf 8.0 and seeing very high memory usage when attempting to extract the text from large PDF files (thousands of pages).

We need to extract the text in a way that allows us to identity which page the text was on. So we cannot use a textabsorber on the entire pages collection. But if we loop through each page using a new textabsorber on the same Pdf.Document object, memory usage behaves like a leak. If we instead create a new document object per page or a small group of pages, we can keep memory manageable, but performance obviously suffers from the overhead of re-creating the document object.

Here is some test code showing the issue, if you run as a 32bit app you will just get out of memory exceptions if the doc is large enough, and for 64bit it can consume GBs of memory. There appears to be no way to dispose of the memory allocated when calling .absorb() or .visit() in conjunction with the TextAbsorber on a document. The TextAbsorber does not implement IDisposable so the memory appears to be rooted on the document object.

Is this a bug that can be fixed, or is there another way we can extract text from each page of a large PDF document without the observed memory issues or simply re-creating the document object for small batches of pages? Example code and PDF attached.

tilal.ahmad · May 23, 2013, 3:22am

Hi there,

Thanks for your inquiry. After initial investigation, we’ve logged the issue as PDFNEWNET-35329 in our issue tracking system for further investigation and resolution. We will keep you updated about issue progress via this forum thread.

Sorry for the inconvenience faced.

Best Regards,