Memory leak when extracting text

Aaron2 · July 13, 2015, 7:22pm

Using .Net framework 4.5 on Windows 7 Visual Studio 2012 and most recent version of Aspose Dlls.

RAM seems to be continuously increasing while extracting text from a very large PDF page by page. Ideally each page would take a constant amount of RAM and the memory profile would be flat or near flat over time. It seems that Aspose.Pdf is holding something internally, leading to reserved memory slowly increasing for each page that is processed. Running the code below we produced this memory profile with ANTS memory profiler 8.6. Note that garbage collection is explicitly fired each time, running the same code using other tools (iText) does produce a flat profile.

Here’s the c# code we used

public static IEnumerable getFileContents(string filePath)
{
if (File.Exists(filePath))
{
Document myDoc = new Document(filePath);
string contents = “”;

//create text device
TextDevice textDevice;
//set text extraction options - set text extraction mode (Raw or Pure)
TextExtractionOptions textExtOptions = new TextExtractionOptions(
TextExtractionOptions.TextFormattingMode.Raw);

foreach (Page pdfPage in myDoc.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
textDevice = new TextDevice();
textDevice.ExtractionOptions = textExtOptions;

//convert a particular page and save text to the stream
textDevice.Process(pdfPage, textStream);

//close memory stream
textStream.Close();

contents = Encoding.Unicode.GetString(textStream.ToArray());
}
yield return contents;
}
}
}

static void Main(string[] args)
{
//set the license for aspose. Without this it will not fully extract text!
AsposeTotalLicense asposeLic = new AsposeTotalLicense();
asposeLic.ApplyLicensePdf();

int i = 0;
foreach (String s in getFileContents(@“C:\Users\ADS\Desktop\Aspose Tests\Aus.pdf”))
{
Console.WriteLine(i++);
GC.Collect();
}
}

Here is a link to the file in question. It’s so big simply to give the problem long enough to materialize. We are aware that this is a scanned PDF that contains no text, it’s here only to show the increasing trend in RAM use over time.

tilal.ahmad · July 14, 2015, 11:12pm

Hi Aaron,

Thanks for your inquiry. We have logged an investigation ticket PDFNEWNET-39026 in our issue tracking system to investigate and rectify the issue. We will keep you updated about the issue resolution progress within this forum thread.

We are sorry for the inconvenience caused.

Best Regards,

Aaron2 · July 17, 2015, 10:20am

Hi Tilal!

Any updates on this issue, it’s really holding us back.

tilal.ahmad · July 17, 2015, 7:33pm

Hi Aaron,

We are sorry for the inconvenience. I am afraid we have recently noticed the issue and it is pending for the investigation. We will notify you as soon as we made some significant progress towards issue resolution.

Best Regards,

Aaron2 · August 11, 2015, 1:44pm

Any updates yet?

tilal.ahmad · August 12, 2015, 1:54am

Hi Aaron,

Thanks for your inquiry. I am afraid it is still pending for investigation in the queue. Currently product team is busy to resolve other issues, reported earlier. We will notify you as soon as we made some significant progress towards issue resolution.

We are sorry for the inconvenience caused.

Best Regards,