Hi
There appears to be some memory management issues within the PDF product when extracting text from PDFs.
The code we were using was from your samples:
var textAbsorber =
new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.Visit(document);
In our tests we were able to use all of a test servers memory (24GB) in roughly 7 minutes (multi threaded code against many copies of the attached sample document).
We then adjusted our code to the below:
foreach (var page in document.Pages)
{
var textAbsorber =
new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
textAbsorber.Visit(page);
sb.Append(textAbsorber.Text);
if (i++ % 100 == 0)
{
GC.Collect();
GC.WaitForPendingFinalizers();
}
}
Using the new method you could see the memory being consumed - but then being released.
I noticed a number of other articles regarding this on the blog - but couldn’t see any solutions (like the above), and I also saw some responses suggesting that this had been resolved (Which - I don’t believe it has).
Thanks