Redact certain text from PDF using Aspose.PDF for .NET - CPU usage and memory consumption

Hi,

We are looking to redact upwards of thousands of objects on the some documents. I have been trying varying ways to redact with Aspose.PDF and noticed that there are huge memory and execution time increases as more redaction annotations are applied–for example, applying 400 redactions on a very small file takes up ~1.7gb of memory, whereas applying 800 takes 4x as long as applying 400 and takes upwards of 6.8gb of memory. I was wondering if there was a way to redact in an efficient manner? Or is there a way to clean up the document as redactions are applied so the execution time/memory spike doesn’t happen?

I tried re-opening the file as every X redactions and applying new redactions but it seemed like the time and memory spike per redaction was still there.

Repro code and file below (although this is happening on all files we’ve tried so far):

using (var document = new Document("SimpleText.pdf"))
{
	var page = document.Pages[1];

	for (var j = 0; j < 1_000; j++)
	{
		var annotation = new RedactionAnnotation(page, new Rectangle(j, j, j + 1, j + 1))
		{
			FillColor = Color.Black,
			Color = Color.Black,
			BorderColor = Color.Black
		};

		page.Annotations.Add(annotation);
		annotation.Redact();
	}
}

Looped (still has memory spike issue, just slightly more manageable):

using (var document = new Document("SimpleText.pdf"))
{
	var page = document.Pages[1];

	for (var i = 0; i < 10; i++)
	{
		for (var j = 0; j < 100; j++)
		{
			var annotation = new RedactionAnnotation(page, new Rectangle((100 * i) + j, (100 * i) + j, (100 * i) + j + 1, (100 * i) + j + 1))
			{
				FillColor = Color.Black,
				Color = Color.Black,
				BorderColor = Color.Black
			};

			page.Annotations.Add(annotation);
			annotation.Redact();
		}
	}
}

SimpleText.zip (36.3 KB)

On a related note, there also seems to be an issue with disposing of the file properly after redactions have been made. Disposing the Aspose.Pdf.Document did not completely clear up memory usage within our .NET application.

Thanks!

@bvk,

I have observed the issue that you have mentioned and have logged it as PDFNET-47766 in our issue tracking system. We will further look into details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

Hi,

Has there been any update to this issue?

I have had time to do more performance testing on the implementation in general and have noticed some approximate baselines-- for every redaction on a page in a pdf memory usage seems to go up ~100kb per redaction. So using annotation.Redact() on a page 200 times takes (on average) ~2mb/call (totaling 400mb), but using it on a page 300 times takes ~3mb/call (900mb), 400 times takes ~4mb/call (1600mb). This is still a problem for us because on the cases where we are calling that 300 times on a page, purely making the redactions costs 900mb/page (without the document overhead to boot). With resources not being released until the closing of the document if that happens on even a 10 page document Aspose PDF will use 9gb(!!) of memory.

Thanks

@bvk

Regretfully, the ticket is not yet resolved. It will be investigated and resolved on first come first serve basis. However, we have logged your concerns along with the ticket and consider them during analysis. Please be patient and spare us some time.

We are sorry for the inconvenience.