Replace text in pdf document taking too much time and consuming too much cpu

jcgm2 · August 23, 2022, 9:54am

We are using Aspose.Pdf in .NET to replace sensitive text in pdf. We want to replace with ### all the regular expressions found in the all the pages in the pdf.

If Pdf file is small and it has few words to find, then everything is fine. But if the pdf is huge (with a lot of pages and multiple regular expressions to replace), it takes too much time in replacing it and it also consumes too much cpu.

This is our code using TextFragmentCollection class:

public byte[] ReplaceSensitiveText(byte[] docPdf, List regularExpressions)

    {

        using MemoryStream ms = new MemoryStream(docPdf);

        using Document pdfDocument = new Document(ms);

        foreach (var item in regularExpressions)
        {

           TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(Encoding.UTF8.GetString(Convert.FromBase64String(item)));

            TextSearchOptions textSearchOptions = new TextSearchOptions(true);

            textFragmentAbsorber.TextSearchOptions = textSearchOptions;

            foreach (var page in pdfDocument.Pages)

            {
   page.Accept(textFragmentAbsorber);

               TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

                foreach (TextFragment textFragment in textFragmentCollection)

                {
                    textFragment.Text = "###";

                }

            }

        }

        using MemoryStream mso = new MemoryStream();

        pdfDocument.Save(mso);

        return mso.ToArray();

We have also tried another solution using PdfContentEditor class. This way of doing is faster but it consumes too much memory.

The code is:

public byte[] ReplaceSensitiveText(byte[] docPdf, List regularExpressions)
{
using var ms = new MemoryStream(docPdf);

        using var mso = new MemoryStream();

        using PdfContentEditor pdfContent = new PdfContentEditor();

        pdfContent.BindPdf(ms);

        foreach (var item in regularExpressions)
        {
            pdfContent.ReplaceTextStrategy = new ReplaceTextStrategy()
            {
                IsRegularExpressionUsed = true,
                ReplaceScope = ReplaceTextStrategy.Scope.ReplaceAll
            };

            pdfContent.ReplaceText(Encoding.UTF8.GetString(Convert.FromBase64String(item)), "###");
            
        }

        pdfContent.Save(mso);

        pdfContent.Close();

        return mso.ToArray();
    }

We have recently licensed the last version of Aspose Total.

We would like you to tell us which is the fastest and the most efficient way to replace text in pdf document.

tahir.manzoor · August 23, 2022, 11:00am

@jcgm2

To ensure a timely and accurate response, please attach the following resources here for testing:

Your input PDF file.
Please attach the expected output PDF file that shows the desired behavior.
Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

jcgm2 · August 24, 2022, 12:45pm

AsposeReplaceTextProblem.zip (620.2 KB)

As I said before, we have the last version of the aspose Total license so we work with it.
Here you have the zip with data and two console applications.

tahir.manzoor · August 24, 2022, 7:18pm

@jcgm2

We are working over your query and will get back to you soon.

tahir.manzoor · August 25, 2022, 1:27pm

@jcgm2

We have logged this problem in our issue tracking system as PDFNET-52392. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.