Memory issue with TextFragmentAbsorber

cyginfo · April 18, 2022, 1:47pm

We are using below code to replace string with blank in PDF file. We are facing issue of taking more memory. mainly doc.Pages.Accept(absorber). Do you have any other option with minimum memory. We are using 21.12 version.

string pattern = “HORIZON/WINDOW|Revolutions start|ULAGE”;
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);

                        var textSearchOptions = new TextSearchOptions(true);
                        TextFragmentAbsorber absorber = new TextFragmentAbsorber(regex);
                        absorber.TextSearchOptions = textSearchOptions;
                        absorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

                        doc.Pages.Accept(absorber);
                        TextFragmentCollection textFragmentCollection = absorber.TextFragments;
                        foreach (TextFragment textFragment in textFragmentCollection)
                        {
                            textFragment.Text = string.Empty;
                        }

asad.ali · April 18, 2022, 2:30pm

@cyginfo

Could you please share the sample PDF document as well for our reference? We will test the scenario in our environment and address it accordingly.

cyginfo · April 18, 2022, 2:50pm

I will arrange document to be sent to you by tomorrow as dev team left for day. Just to add here, we believe this issue is not for particular document but for any type of document.

asad.ali · April 18, 2022, 8:55pm

@cyginfo

In order to prevent high memory consumption, you can search and get the text on page level like below:

foreach(Page page in doc.Pages)
{
 page.Accept(absorber);
}

However, if it still does not help, please share a sample file for our reference so that we can further test the scenario in our environment and address it accordingly.

cyginfo · April 19, 2022, 4:11am

Please download document(20 MB.pdf) from below google drive links,

https://drive.google.com/file/d/15VU36fVI2SQcbkGnDp3aftst8wbIdvsd/view?usp=sharing

Let me know if you face any difficulty to download file.

asad.ali · April 19, 2022, 3:35pm

@cyginfo

Are you sure that the regular expression you shared with us is able to extract the text from this PDF? We tested it in our environment and it was not finding any text. Furthermore, we could not notice the memory consumption issue while testing using below code and 22.3 version of the API:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(dataDir + @"20 MB.pdf");

string pattern = "HORIZON/WINDOW|Revolutions start|ULAGE";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
foreach (var page in pdfDocument.Pages)
{
 var textSearchOptions = new TextSearchOptions(true);
 TextFragmentAbsorber absorber = new TextFragmentAbsorber(regex);
 absorber.TextSearchOptions = textSearchOptions;
 absorber.TextReplaceOptions = new TextReplaceOptions(TextReplaceOptions.ReplaceAdjustment.None);

 page.Accept(absorber);
 TextFragmentCollection textFragmentCollection = absorber.TextFragments;
 foreach (TextFragment textFragment in textFragmentCollection)
 {
  textFragment.Text = string.Empty;
 }
}