Extract text and access PDF Layers in C# with Aspose.PDF for .NET - Memory consumption is large

Chris2Stein · February 17, 2020, 8:57am

Hi,

when extracting text from attached pdf using the following code memory allocation of e.g. a 64bit test console is increasing up to 12GB and extraction is never finished.

static void ExtractTextFromPDF()
        {
            Aspose.Pdf.License lic = new Aspose.Pdf.License();
            lic.SetLicense("Aspose.Total.lic");
            Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(@"c:\@@tmp\Layers_1.pdf");

            TextAbsorber textAbsorber = new TextAbsorber();

            textAbsorber.ExtractionOptions.FormattingMode = TextExtractionOptions.TextFormattingMode.Raw;
            textAbsorber.Visit(pdf);
            // Just the same....
            //pdf.Pages.Accept(textAbsorber);

            Console.WriteLine(textAbsorber.Text);
        }

        static void Main(string[] args)
        {
            try
            {
                ExtractTextFromPDF();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex);
                Console.WriteLine("Workingset: {0}", System.Diagnostics.Process.GetCurrentProcess().WorkingSet64);
            }
            Console.ReadLine();
        }

Aspose_Memory.png (24.0 KB)

Another issue:
I am not able to access layers of the document using Page.Layers property. Page.Layers is always null.
Had this behaviour for all documents I checked.

    foreach (Aspose.Pdf.Page page in pdf.Pages)
    {
        if (page.Layers != null) //page.Layers is always null!!
        {
            // do something
        }
    }

I was not able to upload example PDF (22MB).

Best Regards
Chris2Stein

asad.ali · February 17, 2020, 6:29pm

@Chris2Stein

Would you kindly upload the sample document to Google Drive or Dropbox and share the link with us. We will test the scenario in our environment and address it accordingly.

Chris2Stein · February 18, 2020, 7:39am

I put one example file on DropBox:
https://www.dropbox.com/s/hpqygnihudy0a9o/Layers_1.pdf?dl=1

asad.ali · February 18, 2020, 5:05pm

@Chris2Stein

Thanks for sharing the sample PDF.

We were able to reproduce both issues in our environment during testing the scenario with Aspose.PDF for .NET 20.2 and logged them in our issue tracking system as following:

PDFNET-47696 - Related to memory consumption
PDFNET-47697 - Related to access layers

We will surely look into details of these issues and keep you informed with their resolution status. Please be patient and spare us little time.

We are sorry for the inconvenience.