HIgh memory usage on pdf document object

grinaypps · March 15, 2022, 6:42pm

Hello. We are using aspose.pdf in our product and using it to search with regex certain patterns.
My IDE is claiming that there is a big memory allocations for pdf document object.
The code example which is shows that:

using var pdfDocument = new Pdf.Document(inputStream);
var patterns = input.Texts.Aggregate((acc, next) => $"({acc})" + "|" + $"({next})");

var textFragmentAbsorber =
    new TextFragmentAbsorber(new Regex($"{patterns}", RegexOptions.IgnoreCase));

pdfDocument.Pages.Accept(textFragmentAbsorber);

…Process result

And IDE shows me that 18425 mb allocated on this line pdfDocument.Pages.Accept(textFragmentAbsorber);

The test document if 140 pages document with size of 1.6 mb.

What can we do with this? I think this may lead to problem when there will be high amount of users using that. We are in development process and going to production soon. Don’t want to face into the issue with memory.

asad.ali · March 15, 2022, 11:21pm

@grinaypps

The memory usage depends upon the document size as well as complexity of its structure. It is quite possible that a small size PDF may have complex structure and a lot of elements on single page that big memory allocations are required during its processing. Make sure to use x64 mode of debugging while working with the larger and complex documents.

Also, you can break the text absorbing on page level instead of doing it for whole document at a time. This way the memory consumption will be low and code will produce results quicker. For example,

foreach(Page page in pdfDocument.Pages)
{
 var textFragmentAbsorber = new TextFragmentAbsorber(new Regex($"{patterns}", RegexOptions.IgnoreCase));
 page.Accept(textFragmentAbsorber);
// do some stuff
}

grinaypps · June 4, 2022, 11:03am

@asad.ali we still continue experience memory allocation issue. I would say it’s memory leak.
The current usage of library is

var inputStream = new MemoryStream(pdfFile.FileBytes);
var pdfDocument = new Pdf.Document(inputStream);

var textFragmentAbsorber =
        new TextFragmentAbsorber(new Regex(term.Pattern, RegexOptions.IgnoreCase));
    foreach (var page in pdfDocument.Pages)
{
    page.Accept(textFragmentAbsorber);
    //extract result
}
pdfDocument.FreeMemory();
pdfDocument.Dispose();


await inputStream.DisposeAsync();

// GC.Collect(0, GCCollectionMode.Forced);
GC.Collect();
GC.WaitForPendingFinalizers();

We make a load test for it, running 50 thread. in 5 minutes memory usage reached around 20gb.
After you stop load test memory will never be released.

DPA shows that problem lies somewhere in TextFragmentAbsorber.
Screenshot at Jun 04 19-02-32.jpg (869.6 KB)

This is really stopped us from moving to production due to this issue. How can we solve it?

asad.ali · June 4, 2022, 11:07pm

@grinaypps

Would you please make sure to test the case using 22.5 version of the API? In case issue still persists, please share a sample console application along with the sample PDF file for our reference so that we can run it and test it in our environment to reproduce the issue. We will further proceed to assist you accordingly.

John800 · August 14, 2023, 10:17pm

I tried exactly as described and the memory still increased to 1Gb. I am using 23.7.0 of Aspose.PDF.NET. I understand that the memory might be needed, but we must be able to release it. Right now the TextFragmentAbsorber holds on to it until the application is recycled. Below is my code example.

        foreach(var page in searchDoc.Pages)
        {
            var textFragmentAbsorber = new TextFragmentAbsorber(request.FBarOnly ? @"Electronically File FBARs" : @"\bERO\b|\(ERO\)|ELECTRONIC RETURN ORIGINATOR|electronic return originator|Electronic Return Originator")
            {
                TextSearchOptions = new TextSearchOptions(true)
            };
            page.Accept(textFragmentAbsorber);
            if (textFragmentAbsorber.TextFragments.Count > 0 && !pageNumbers.Contains(page.Number))
                pageNumbers.Add(page.Number);
            textFragmentAbsorber.Reset();
            textFragmentAbsorber = null;
        }

        // Clean up...
        //textFragmentAbsorber.Reset();
        //textFragmentAbsorber = null;
        searchDoc.FreeMemory();
        searchDoc.Dispose();
        searchStream.Dispose();

        GC.Collect();
        GC.WaitForPendingFinalizers();

asad.ali · August 15, 2023, 12:37am

@John800

Can you please share your sample PDF with us so that we can test the scenario in our environment and address it accordingly.

John800 · August 15, 2023, 12:22pm

AllSigForms.pdf (2.0 MB)

asad.ali · August 15, 2023, 6:34pm

@grinaypps

We tested the scenario in our environment (single thread execution) and memory consumption was 336MB. However, we did notice that it did not get released after the process ended.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55291

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

tkamstra · October 17, 2023, 8:15pm

Is there any update on this? We are also running into this issue.

asad.ali · October 18, 2023, 12:06am

@tkamstra

We are afraid that the earlier logged ticket has not been yet resolved due to other issues in the queue. We will surely investigate and resolve it on a first come first serve basis. As soon as we have some updates in this regard, we will update you in this forum thread. Please be patient and spare us some time.

We are sorry for the inconvenience.

Jaibir · April 16, 2024, 4:19pm

Hi @asad.ali
Is this Issue ID(s): PDFNET-55291 resolved ?
We are facing OutOfMemoryException error with aspose.pdf version 24.4.0.

Thanks in advance

asad.ali · April 16, 2024, 11:00pm

@Jaibir

We are afraid that the earlier logged ticket has not been yet resolved. We have however recorded your concerns and will surely inform you as soon as we make some progress towards ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.