High Memory consumption in TextAbsorber read

cpaperless · July 1, 2024, 5:20am

Hi,

We are trying to read text from a large PDF and we are observing high memory consumption which leads into system going outofmemory.

We use below code for text extraction page by page

 public void PerformanceTest()
 {
     var filelicense = File.OpenRead("D:\\AsposeTotalNET.lic");
     License license = new License();
     license.SetLicense(filelicense);
     List<string> pageTextList = new List<string>();

     using (var file = File.OpenRead("D:\\large 30k.pdf"))
     {
         using (Aspose.Pdf.Document doc = new Aspose.Pdf.Document(file))
         {
             foreach (Page page in doc.Pages)
             {
                 TextAbsorber textAbsorber = new TextAbsorber();
                 textAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
                 page.Accept(textAbsorber);
                 pageTextList.Add(textAbsorber.Text);
             }
         }
     }
     
 }

Aspose memory issue.png (26.4 KB)

The Memory profiler shows huge 4.6 GB of memory consumption for reading texts from all the pages and it varies for every run.
Please find the sample document link

We have tried with the version like 23.12.0 and latest 24.6.0 and observed the same behavior.

asad.ali · July 1, 2024, 7:59pm

@cpaperless

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57559

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

aspose.notifier · August 19, 2024, 2:55pm

The issues you have found earlier (filed as PDFNET-57559) have been fixed in Aspose.PDF for .NET 24.8.