High memory usage when extracting text from PDF

michaelpaye · March 8, 2019, 3:14pm

Hi

There appears to be some memory management issues within the PDF product when extracting text from PDFs.

The code we were using was from your samples:

var textAbsorber =
new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.Visit(document);

In our tests we were able to use all of a test servers memory (24GB) in roughly 7 minutes (multi threaded code against many copies of the attached sample document).

We then adjusted our code to the below:

foreach (var page in document.Pages)
{
var textAbsorber =
new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.Visit(page);
sb.Append(textAbsorber.Text);

                if (i++ % 100 == 0)
                {
                    GC.Collect();
                    GC.WaitForPendingFinalizers();
                }
            }

Using the new method you could see the memory being consumed - but then being released.

I noticed a number of other articles regarding this on the blog - but couldn’t see any solutions (like the above), and I also saw some responses suggesting that this had been resolved (Which - I don’t believe it has).

Thanks

michaelpaye · March 8, 2019, 3:15pm

Hard to tell if it did upload the example file - but just in case its easy to re-produce - simply copy Lorum Ipsum into Word until you have ~3500 pages - and save as PDF.

Farhan.Raza · March 8, 2019, 9:14pm

@michaelpaye

Thank you for contacting support.

Please note that smaller files can be uploaded to forums; if your file size is huge then share it by uploading to Google Drive, Dropbox etc. so that we may try to reproduce and investigate it in our environment.

michaelpaye · March 9, 2019, 8:12am

Hi Farhan - as mentioned, any large PDF will do - Lorum Ipsum is just one quick way to achieve that:

Farhan.Raza · March 9, 2019, 5:41pm

@michaelpaye

Thank you for sharing requested data.

We are looking into the scenario and will share our findings with you soon.

Farhan.Raza · March 10, 2019, 5:53pm

@michaelpaye

We have noticed that below code snippet consumes around 400 MB memory and about 3.5 minutes with Aspose.PDF for .NET 19.3 in our environment. Would you please ensure using latest version of the API and then share your kind feedback with us.

var document = new Document(dataDir + "Lorem ipsum dolor sit amet.pdf");
StringBuilder sb = new StringBuilder();
foreach (var page in document.Pages)
{
    var textAbsorber =
    new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
    textAbsorber.Visit(page);
    sb.Append(textAbsorber.Text);
}

michaelpaye · March 11, 2019, 7:08am

Hi Farhan

The problem as mentioned is really part of multi-threading - did you try multi threading your above code?

I also am surprised that you don’t see 400MB of memory usage for an 18MB PDF as poor utilisation?

Thanks,
Mike

Farhan.Raza · March 11, 2019, 2:26pm

@michaelpaye

Thank you for elaborating it further.

Would you please share your code snippet so that we may replicate it exactly. Moreover, we have shared our findings with you to get on the same page before investigating further. Kindly share requested code so that we may proceed and assist you efficiently.