document.Convert() CONSUMES HUGE MEMORY

ST2YKE2 · March 11, 2024, 5:27pm

I would like to +1 this. We are currently having the exact same issue with almost identical code. We are also having the exact same issue when using PdfContentEditor().ReplaceText(). Doing this all within usings etc.

`using (var editor = new PdfContentEditor())
{
var pdfFilePath = Path.Combine(savePath, $“{page.Number}.pdf”);

HandlePdfRightAlignmentText(page, compileAttachment);

var newPdfDocument = new Aspose.Pdf.Document();
newPdfDocument.Pages.Add(page);

if (compileAttachment.MagicTags != null)
{
    MagicTag magicTag = compileAttachment.MagicTags.FirstOrDefault(x => x.Tag == "{{item.number}}");
    editor.BindPdf(newPdfDocument);

    //Replace all the matching keys in the text
    editor.ReplaceTextStrategy.ReplaceScope = ReplaceTextStrategy.Scope.ReplaceAll;
    editor.ReplaceText(magicTag.Tag, magicTag.Value ?? "");
    editor.Save(pdfFilePath);

    GC.Collect();
    GC.WaitForPendingFinalizers();
}

}`

I have been looking for a workaround to this issue for over a week now. Even to the extent that I’m saving out each page of the document now and attempting to process the pages individually and then concatenate them back together using PdfFileEditor().Concatenate()

concateFilesPath = Directory.GetFiles(savePath, "*.pdf", SearchOption.TopDirectoryOnly);
var pdfFileEditor = new PdfFileEditor();
pdfFileEditor.Concatenate(concateFilesPath, tempPath);

asad.ali · March 12, 2024, 12:17am

@ST2YKE2

We apologize for the inconvenience that you have been facing due to this issue. We have recorded your concerns and will surely update you as soon as some progress is made towards ticket resolution.

ST2YKE2 · March 12, 2024, 3:56pm

@cpaperless
Did you ever figure out a work around or another approach for this?

asad.ali · March 12, 2024, 9:56pm

@ST2YKE2

The ticket is currently under the phase of the investigation and as soon as we make some progress towards its resolution, we will update you via this forum thread.

ST2YKE2 · March 14, 2024, 7:46pm

All you technically have to do is create a PDF document and then use document.FreeMemory() and document.Dispose() and watch the fact that the memory is not released at all. You can even set document = null; after and it still doesn’t release memory. Also you can see my incredibly excessive and somewhat dangerous use of GC.Collect() and GC.WaitForPendingFinalizers() everywhere. This is to try and force garbage collection which still does not release the memory.

If you really want to make the system work use the code I sent previously that is finding and replacing text and watch the fact that it doesn’t release it’s memory. This is becoming incredibly urgent for us. At this point I have been looking for any other solution to this problem. I’m starting to look into other libraries to make this work because it is causing havoc with our customers that we service. Just so you know my code is very messy right now because I have re-written it over and over again attempting to find a workaround to the problem.

In the below code I have added comments so you can see where the problems are. Keep in mind my code is so crazy where I’m creating a document and saving out pages and then attempting to process individual pages to work around this horrible problem.

private async Task ReplaceAttachmentWithSignatures(CompileAttachment compileAttachment)
{
    var savePath = GetSavePath();
    var tempPath = Path.Combine(savePath, $"temp.pdf");
    var flattenedFile = Path.Combine(savePath, "FlattenedPdf.pdf");
    var filePath = Path.Combine(savePath, "ToBeFlattenedPdf.pdf");

    await _azureProvider.DownloadToFileAsync(compileAttachment.AzurePdfPath, filePath);

    try
    {
        //In this using the tempDocument never releases it's memory until randomly the GC.Collect() in the finally runs.
        //Randomly meaning that can run several times and maybe the 6th time I see the memory finally go down again.
        using (var tempDocument = new Aspose.Pdf.Document(filePath))
        {
            File.Delete(filePath);

            //pdfForm is the same as above
            using (var pdfForm = new Form())
            {
                ELSLogHelper.InsertInfoLog(ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);

                pdfForm.BindPdf(tempDocument);
                pdfForm.FlattenAllFields();
                pdfForm.Save(flattenedFile);
                ELSLogHelper.InsertInfoLog(ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
            }
        }

        ELSLogHelper.InsertInfoLog(ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
        //Again document is the same as above.
        using (var document = new Aspose.Pdf.Document(flattenedFile))
        {
            File.Delete(flattenedFile);

            foreach (var page in document.Pages)
            {
                var pdfFilePath = Path.Combine(savePath, $"{page.Number}.pdf");
                //newDocument is the same as above and I have watched as this ran over and over again and despite all my calls to relase memory and dispose it it just maintains the memory.
                var newDocument = new Aspose.Pdf.Document();
                newDocument.Pages.Add(page);
                newDocument.Optimize();
                newDocument.Save(pdfFilePath);
                newDocument.FreeMemory();
                newDocument.Dispose();
                newDocument = null;
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
        }

        var filesToProcess = Directory.GetFiles(savePath, "*.pdf", SearchOption.TopDirectoryOnly).OrderBy(x => Convert.ToInt32(Path.GetFileNameWithoutExtension(x))).ToArray();

        ConcatanatePdfFiles(filesToProcess, compileAttachment);
    }
    catch (Exception ex)
    {
        var logManagerModel = new LogManagerModel
        {
            Exception = ex,
            ExceptionData = new Dictionary<string, string>()
            {
                { "Message", $"Failed to replace attachment with magic tag value." },
                { "CallerMemberName", $"{typeof(BaseCompiler).FullName}" },
                { "CallerMethodName", $"{MethodBase.GetCurrentMethod()?.Name}" },
                { "CallerLineNumber", $"{new StackTrace(ex, true).GetFrame(0).GetFileLineNumber()}" }
            }
        };
        _customerCallContext.LogManager.Error(logManagerModel);

        File.Delete(tempPath);

        GC.Collect();
        GC.WaitForPendingFinalizers();

        throw;
    }
    finally
    {
        await _azureProvider.SaveAzureFileAsync(compileAttachment.AzurePdfPath, File.ReadAllBytes(tempPath));

        File.Delete(tempPath);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

private void ProcessFilesInDirectory(string fileToProcess, CompileAttachment compileAttachment)
{
    var savePath = GetSavePath();
    var newPdfDocument = new Aspose.Pdf.Document(fileToProcess);
    var pdfFilePath = Path.Combine(savePath, $"{newPdfDocument.Pages[1].Number}.pdf");
    var magicTag = compileAttachment.MagicTags.FirstOrDefault(x => x.Tag == "{{item.number}}");

    HandlePdfRightAlignmentText(newPdfDocument.Pages[1], compileAttachment);

    if (compileAttachment.MagicTags != null)
    {
        var editor = new PdfContentEditor();


        //Replace all the matching keys in the text
        //****THIS IS THE HUGE PROBLEM HERE****
        //This consumes enormouse amounts of memory and despite all my attempts below it never gives the memory back.
        //I have a 90'ish mb file that turns into 11GB during this process.
        editor.BindPdf(newPdfDocument);
        editor.ReplaceTextStrategy.ReplaceScope = ReplaceTextStrategy.Scope.ReplaceAll;
        editor.ReplaceText(magicTag.Tag, magicTag.Value ?? "");
        editor.Document.Optimize();
        editor.Save(pdfFilePath);
        editor.Document.FreeMemory();
        editor.Document.Dispose();
        editor.Dispose();

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

private void ConcatanatePdfFiles(string[] filesToProcess, CompileAttachment compileAttachment)
{
    var savePath = GetSavePath();
    var tempPath = Path.Combine(savePath, $"temp.pdf");

    try
    {
        foreach (var fileToProcess in filesToProcess)
        {
            ProcessFilesInDirectory(fileToProcess, compileAttachment);
        }

        //pdfFileEditor does the same thing. I watch when this runs and my memory climbs and despite setting pdfFileEditor = null the memory is never given back.
        var pdfFileEditor = new PdfFileEditor();
        pdfFileEditor.CloseConcatenatedStreams = true;
        pdfFileEditor.UseDiskBuffer = true;
        pdfFileEditor.Concatenate(filesToProcess, tempPath);
        pdfFileEditor = null;

        filesToProcess.ForEach(x =>
        {
            File.Delete(x);
        });
    }
    catch (Exception ex)
    {
        var logManagerModel = new LogManagerModel
        {
            Exception = ex,
            ExceptionData = new Dictionary<string, string>()
            {
                { "Message", $"Failed to process each page of {compileAttachment.Name}; FileId: {compileAttachment.DocumentId}." },
                { "CallerMemberName", $"{typeof(BaseCompiler).FullName}" },
                { "CallerMethodName", $"{MethodBase.GetCurrentMethod()?.Name}" },
                { "CallerLineNumber", $"{new StackTrace(ex, true).GetFrame(0).GetFileLineNumber()}" }
            }
        };
        _customerCallContext.LogManager.Error(logManagerModel);

        filesToProcess.ForEach(x =>
        {
            File.Delete(x);
        });

        File.Delete(tempPath);

        GC.Collect();
        GC.WaitForPendingFinalizers();

        throw;
    }
    finally
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

public void HandlePdfRightAlignmentText(Aspose.Pdf.Page page, CompileAttachment compileAttachment)
{
    // Create TextAbsorber object to find all instances of the input search phrase
    // Regex pattern like [[ any text {{magictag}} ]]
    Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"\[+\[+\!+[^\[\]]+\!+\]+\]");

    //pragraph alignment
    textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = Aspose.Pdf.Text.TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;

    //enabling regex search
    Aspose.Pdf.Text.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(true);
    textFragmentAbsorber.TextSearchOptions = textSearchOptions;

    // Accept the absorber for all the pages
    page.Accept(textFragmentAbsorber);

    //converting the magic tags to key value pair dic
    Dictionary<string, string> dic = new Dictionary<string, string>();
    if (compileAttachment.MagicTags != null)
    {
        dic = compileAttachment.MagicTags.DistinctBy(x => new { x.Tag, x.Value }).ToDictionary(x => x.Tag, y => y.Value);
    }

    // Get the extracted text fragments
    Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

    // Loop through the fragments
    foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
    {
        // Replace the [[! !]] from the text
        var replacedText = textFragment.Text.Replace(@"[[!", "").Replace(@"!]]", "");
        textFragment.HorizontalAlignment = Aspose.Pdf.HorizontalAlignment.Right;
        textFragment.TextState.Underline = false;
        foreach (var k in dic.Keys.ToList())
        {
            if (replacedText.Contains(k))
            {
                var splittedText = replacedText.Split(new string[] { k }, StringSplitOptions.None);
                var composedValue = string.Join("", splittedText) + "" + dic[k];
                textFragment.TextState.HorizontalAlignment = Aspose.Pdf.HorizontalAlignment.Right;

                var isUnderLine = false;
                if (composedValue.Contains("u=1"))
                {
                    composedValue = composedValue.Replace("u=1", "");
                    isUnderLine = true;
                }

                var indent = CustomerConfig.For(_customerCallContext).PdfXIndent;
                double pdfXIndent = 45;
                if (!string.IsNullOrEmpty(indent))
                {
                    pdfXIndent = Convert.ToDouble(indent);
                }

                textFragment.Text = composedValue;
                textFragment.TextState.Underline = isUnderLine;
                textFragment.TextState.HorizontalAlignment = Aspose.Pdf.HorizontalAlignment.Right;
                textFragment.Position = new Aspose.Pdf.Text.Position(page.Rect.LLX +
                (page.Rect.Width - textFragment.Rectangle.Width - pdfXIndent), textFragment.Position.YIndent);
                break;
            }
        }
    }
}

Also when you call document.Convert() it CONSUMES HUGE MEMORY again. I took screenshot so you can see how much memory this application is consuming when I call document.Convert()

image.png (157.4 KB)

Here is the file that causes HUGE memory consumption even when I’m processing a single page at a time and the editor won’t release.

022024-Ordinance-24-attachment.pdf

Let me know if you need more examples. I’d be happy to provide anything additional to help get this resolved it is killing us right now. I have been spun up on this issue for over two weeks looking for a solution/workaround.

ST2YKE2 · March 15, 2024, 12:28am

He is not apart of my organization.

asad.ali · March 15, 2024, 7:15pm

@ST2YKE2

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56803

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

ST2YKE2 · March 15, 2024, 7:31pm

@asad.ali
I appreciate it, will you still update here as any progress is made?

asad.ali · March 15, 2024, 9:32pm

@ST2YKE2

Yes, we will keep you posted on the status of issue within this forum thread. As soon as some resolution is made, we will send you a notification here as well.

kaznetc · May 15, 2024, 12:03pm

Hello,
Are there any news on this one? I would like to bump this topic. We too are affected by the problem of huge memory consumption when calling the Convert() method.
We combine approximately 140 PDF documents (the total size of all documents is ~150-200MB), and then call Convert with the option new PdfFormatConversionOptions(PdfFormat.PDF_UA_1, ConvertErrorAction.None);.
Memory usage has exceeded 100 GB. This happened on a production system, so I can’t share the documents.

asad.ali · May 15, 2024, 9:28pm

@kaznetc

We are afraid that there is no news about ticket resolution yet. Performance related issues usually take certain amount of time in order to get rectified. Your concerns have been recorded and issue priority has been revived as well. As soon as we have some updates, we will let you know. Please spare us some time.

We are sorry for the inconvenience.

kaznetc · May 17, 2024, 8:05am

@asad.ali
We are using Convert() method to convert pdf to PDF-UA/1 format to ensure the documents is supported by the screen readers. Maybe there is other way to achieve it and do not use this method?

asad.ali · May 17, 2024, 6:57pm

@kaznetc

There is no other method except Convert to convert a PDF into PDF/UA. Your concerns have been recorded along with the ticket and we will surely inform you once we make some progress towards ticket resolution.