Replacing text in large PDFs

gwert · March 31, 2020, 8:04am

Hi there,

We are using Aspose.PDF for .NET version 20.3.0 and the following code:

        var textFragmentAbsorber = new TextFragmentAbsorber("Page ##c# of ##t#")
        {
            TextSearchOptions = {LimitToPageBounds = true}
        };

        document.Pages.Accept(textFragmentAbsorber);
        
        var textFragmentCollection = textFragmentAbsorber.TextFragments;
        
        foreach (var textFragment in textFragmentCollection)
        {
            if (textFragment.Page == null)
                continue;
            
			textFragment.Text =
				textFragment.Text
					.Replace("##c#", $"{textFragment.Page.Number}")
					.Replace("##t#", $"{document.Pages.Count}")
					.PadLeft("Page ##c# of ##t#".Length, ' ');

            textFragment.TextState.HorizontalAlignment = HorizontalAlignment.Right;
        }

to replace the page counts custom marker (current page and total number of pages) in the header of a PDF.
This is a simplified version of a more generic approach where PDF parts (including this one) is merged into a bigger PDF so the greater goal is to prepare the custom page counters marker (Page ##c# of ##t#) is all the parts and then use the TextFragmentAbsorber to replace it accordingly.

The problems we have with this approach is that:

it takes roughly around 20 seconds to run on the attached input.xls.zip (2.8 MB) file.
the memory usage increases to 3Gb while this process runs

We took the approach of using the TextFragmentAbsorber at the page level using the code:

        var textFragmentAbsorber = new TextFragmentAbsorber(pageCountsPhrase)
        {
            TextSearchOptions = {LimitToPageBounds = true}
        };

        foreach (var page in document.Pages)
        {
            page.Accept(textFragmentAbsorber);

            var textFragmentCollection = textFragmentAbsorber.TextFragments;

            foreach (var textFragment in textFragmentCollection)
            {
                if (textFragment.Page == null)
                    continue;

                textFragment.Text =
                    textFragment.Text
                        .Replace(currentPagePlaceholder, $"{textFragment.Page.Number}")
                        .Replace(countPagesPlaceholder, $"{document.Pages.Count}")
                        .PadLeft(pageCountsPhrase.Length, ' ');

                textFragment.TextState.HorizontalAlignment = HorizontalAlignment.Right;
            }
        }

and alleviates the problem with the memory consumption but it doubles the execution time…

Taking the approach of using $p and $P is something that we tried just to find out that:

it is equally time consuming
preparing a PDF for applying the header (the input we’ve send you is the output of that process) requires saving the document and this is the time when the $p and $P are executed. Maybe we could delay that until the final PDF is built-up?

We would really appreciate leads on accomplishing the replace faster and with less memory consumption.

Best regards.

asad.ali · March 31, 2020, 8:58pm

@gwert

The text replacement operation may get time and memory consuming as it is a complicated process. However, we will still investigate the issue in details and check whether it is possible to reduce the time cost more or not. For the purpose, we have logged an investigation ticket as PDFNET-47916 in our issue tracking system. Would you please share your environment details like RAM and CPU information. We will look into details of the ticket and keep you informed with its resolution status.

We are sorry for the inconvenience.

gwert · April 1, 2020, 5:39am

Thank you for looking it!

Is there any other approach on text replacement that we could take here that is less resource consuming?

Thank you!

asad.ali · April 1, 2020, 6:40pm

@gwert

Regretfully there is no other recommended approach or method to replace the text. We will be able to share some feedback as soon as the ticket is investigated and resolved. Please spare us some time.

We are sorry for the inconvenience.