We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

HIgh memory usage on pdf document object

Hello. We are using aspose.pdf in our product and using it to search with regex certain patterns.
My IDE is claiming that there is a big memory allocations for pdf document object.
The code example which is shows that:

using var pdfDocument = new Pdf.Document(inputStream);
var patterns = input.Texts.Aggregate((acc, next) => $"({acc})" + "|" + $"({next})");

var textFragmentAbsorber =
    new TextFragmentAbsorber(new Regex($"{patterns}", RegexOptions.IgnoreCase));

pdfDocument.Pages.Accept(textFragmentAbsorber);

…Process result

And IDE shows me that 18425 mb allocated on this line pdfDocument.Pages.Accept(textFragmentAbsorber);

The test document if 140 pages document with size of 1.6 mb.

What can we do with this? I think this may lead to problem when there will be high amount of users using that. We are in development process and going to production soon. Don’t want to face into the issue with memory.

@grinaypps

The memory usage depends upon the document size as well as complexity of its structure. It is quite possible that a small size PDF may have complex structure and a lot of elements on single page that big memory allocations are required during its processing. Make sure to use x64 mode of debugging while working with the larger and complex documents.

Also, you can break the text absorbing on page level instead of doing it for whole document at a time. This way the memory consumption will be low and code will produce results quicker. For example,

foreach(Page page in pdfDocument.Pages)
{
 var textFragmentAbsorber = new TextFragmentAbsorber(new Regex($"{patterns}", RegexOptions.IgnoreCase));
 page.Accept(textFragmentAbsorber);
// do some stuff
}

@asad.ali we still continue experience memory allocation issue. I would say it’s memory leak.
The current usage of library is

var inputStream = new MemoryStream(pdfFile.FileBytes);
var pdfDocument = new Pdf.Document(inputStream);

var textFragmentAbsorber =
        new TextFragmentAbsorber(new Regex(term.Pattern, RegexOptions.IgnoreCase));
    foreach (var page in pdfDocument.Pages)
{
    page.Accept(textFragmentAbsorber);
    //extract result
}
pdfDocument.FreeMemory();
pdfDocument.Dispose();


await inputStream.DisposeAsync();

// GC.Collect(0, GCCollectionMode.Forced);
GC.Collect();
GC.WaitForPendingFinalizers();

We make a load test for it, running 50 thread. in 5 minutes memory usage reached around 20gb.
After you stop load test memory will never be released.

DPA shows that problem lies somewhere in TextFragmentAbsorber.
Screenshot at Jun 04 19-02-32.jpg (869.6 KB)

This is really stopped us from moving to production due to this issue. How can we solve it?

@grinaypps

Would you please make sure to test the case using 22.5 version of the API? In case issue still persists, please share a sample console application along with the sample PDF file for our reference so that we can run it and test it in our environment to reproduce the issue. We will further proceed to assist you accordingly.