Memory leak in pfd .net

Holger_Schulze · February 25, 2019, 12:12pm

Hello,

I try to search for a set of words in pdf files. I want to identify those files wich contain at least one of my search items. when I run the Program the used memory increases up to 16GB in Visual Studio. My central function is:
public bool CheckPdf(string fileItem)
{
try
{
using (Document pdfDocument = new Document(fileItem))
{
foreach (Page page in pdfDocument.Pages)
{
TextAbsorber textAbsorber =
new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
page.Accept(textAbsorber);
foreach (string suchBegriff in Suchbegriffe)
{
if (textAbsorber.Text.Contains(suchBegriff))
{
pdfDocument.Dispose();
return true;
}
}
}
pdfDocument.Dispose();
return (false);
}
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
return false;
}

I have some 3.000 files to check.´What can I do to reduce the use of memory?

asad.ali · February 25, 2019, 7:18pm

@Holger_Schulze

Thanks for contacting support.

You may please try calling page.Dispose() and pdfDocument.FreeMemory() methods to clear the memory. However, if issue still persists, please some information about your working environment like OS name and version, installed system memory, application type, etc. Also, please share what is the average file size you are trying to process.

It would be helpful if you can share some sample file(s) with us. We will further test the scenario in our environment and address it accordingly.

Holger_Schulze · February 26, 2019, 12:10pm

Thanks for the quick response the changes you suggested did not help. But I found a solution which works fine for me. I wrote my own class, which implements IDisposable In Dispose I inserted:
GC.Collect();
GC.WaitForPendingFinalizers();
That did the trick.Pdfsearcher.zip (693 Bytes)
I added the class if Someone ist interested

asad.ali · February 26, 2019, 6:13pm

@Holger_Schulze

Thanks for your acknowledgement.

It is good to hear that you managed to resolve you issue. Sure, shared Class would definitely help others having similar issue. Please keep using our API and in case of any further assistance, please feel free to let us know.

LeeWheeler · October 5, 2020, 2:34pm

Hello, I am having the same issues as you have had. I cannot access the ZIP file you placed out here. It is locked. Can you send it to me via email possibly? Post back if you can and I will provide my email address. Thank you

asad.ali · October 5, 2020, 8:04pm

@LeeWheeler

We have sent you a private message with requested information. You can check it in your inbox.

dahl.silmer · September 24, 2021, 1:02pm

We are experiencing a similar issue. Could you please send me the shared class as well?

asad.ali · September 27, 2021, 6:07am

@dahl.silmer

A private message has been sent to you including the requested information.

markus.hald · November 25, 2021, 12:51pm

I am experiencing the same issue. Could you share the class with me as well?

Thank you in advance!

Holger_Schulze · November 25, 2021, 2:19pm

This is the code which works for me: (I am just testing how to apply new formatting to old files)

Hope it helps.

                    List<GroupShape> shapes = GetGroupShapesFromDocument(Document2);
                    TestInsertShapes(shapes);


    /// <summary>
    /// Test the copying of GroupShapes 
    /// </summary>
    /// <param name="shapes"></param>
    private void TestInsertShapes(List<GroupShape> shapes)
    {
        Document doc = new Document();
        DocumentBuilder builder = new DocumentBuilder(doc);
        foreach (GroupShape gs in shapes)
        {
            // Insert after last Paragraph
            // Could be any other Paragraph
            InsertAtParagraph(doc, doc.FirstSection.Body.LastParagraph, gs, prepend: false);
            // Just to have enough space for the Groupshapes
            builder.Writeln("Line 1");
            builder.Writeln("Line 2");
            builder.Writeln("Line 3");
            builder.Writeln("Line 4");
            builder.Writeln("Line 5");
        }
        try
        {
            doc.Save($"TestCopyComplete.docx");
        }
        catch (Exception ex)
        {

            Console.WriteLine($"{ex.ToString()}");
        }
    }

    /// <summary>
    /// Get all GroupShapes from Document
    /// </summary>
    /// <param name="doc"></param>
    /// <returns></returns>
    private List<GroupShape> GetGroupShapesFromDocument(Document doc)
    {
        List<GroupShape> shapes = new List<GroupShape>();
        foreach (GroupShape shape in doc.GetChildNodes(NodeType.GroupShape, true))
        {
            // This test was necessary because GroupShapes within GroupShapes
            // caused an error on saving 
            if (shape.PreviousSibling == null ||
                (shape.PreviousSibling.NodeType != NodeType.GroupShape && shape.PreviousSibling.NodeType != NodeType.Shape))
            {
                shapes.Add(shape);
            }

        }
        return shapes;
    }

    public void InsertAtParagraph(Document doc,Paragraph p, GroupShape shape,bool prepend = false)
    {
        NodeImporter importer = new NodeImporter(shape.Document, doc, ImportFormatMode.KeepSourceFormatting);
        GroupShape newNode = (GroupShape) importer.ImportNode(shape, true);
        // 
        if (prepend)
        {
            p.PrependChild(newNode);
        } else
        {
            p.AppendChild(newNode);
        }
    }

markus.hald · November 29, 2021, 9:23pm

Hey Holger,

Im doing html to pdf conversion and I was wondering if you could help me understand how I might apply the code you sent here to that. It seems that I am running into the same issue that you described in '19 where the document is never disposed.

Thank you in advance!

Holger_Schulze · November 30, 2021, 7:33am

Hello Markus,

I post the code of the class including how I use it.
In short I open many pdf files and search for some special words in them. That caused crashes of the software because I ran out of memory. So i have built a class sourrounding the opening and closing of the pdf file. I tried to translate the code from german to english. Hope it helps.

// This part is used in a loop through many Directories
// in a Parallel loop
// Parallel.ForEach(requests, new ParallelOptions { MaxDegreeOfParallelism = 4 }, request =>
//
// Search in all files in a directory
DirectoryInfo di = new DirectoryInfo(path);
foreach (var fi in di.GetFiles())
{
// create the Searcher
// using will dispose the PDFSearcher after use
using (var searcher = new PDFSearcher())
{
var r = searcher.CheckPdfList(fi.FullName, ListOfSearchItems, Invert);
if (r.Count(l => l.Value.Count > 0) > 0)
{
result.Add(request);
if (!rs.TryAdd(request.ID, r))
{

		}
	}
}

}

// The PDFSearcher class
using Aspose.Pdf;
using Aspose.Pdf.Text;
using System;
using System.Collections.Generic;

namespace AnalyseCallisto.FrameWork
{
class PDFSearcher :IDisposable
{
// Searches in a pdf file for existence of at least one of the
// words in Searchitems
public bool CheckPdf(string fileItem, List Searchitems, bool invert)
{
try
{
using (Document pdfDocument = new Document(fileItem))
{
foreach (Page page in pdfDocument.Pages)
{
TextAbsorber textAbsorber =
new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
page.Accept(textAbsorber);
page.FreeMemory();
page.Dispose();
foreach (string searchitem in Searchitems)
{
if (textAbsorber.Text.Contains(searchitem))
{
pdfDocument.FreeMemory();
pdfDocument.Dispose();
return true && !invert;
}
}
}
pdfDocument.FreeMemory();
pdfDocument.Dispose();
return (false || invert);
}
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
return false;
}

    public void Dispose()
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }


}

}