Heavy memory consumption on Document.Find()

Hello,

I am experiencing problems with Document.Pages.Accept(TextFragmentAbsorber):

The error occurs with the PDF document I have attached to this post. The code works with many other PDF files. The error can be reproduced on multiple machines and in simple unit tests.

When using this code:

_doc = new Document(fileName);
var textFragmentAbsorber = new TextFragmentAbsorber($"(?i){term}");
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
_doc.Pages.Accept(textFragmentAbsorber);

with term = 'storytelling’

CPU usage rises to 90%+ and memory usage constantly rises to multiple GB until all memory is consumed.

1. Why is this behavior occurring?
2. How can I fix it?
3. Is there a way to detect and abort the process when the faulty behavior occurs?

Thank you very much for your help!




Hi there,

Thanks for your inquiry. I have tested text searching scenario with shared document using Aspose.Pdf for .NET 11.5.0 and managed to observe the reported memory consumption issue. For further investigation, I have logged an issue in our issue tracking system as PDFNEWNET-40635 and also linked your request to it. We will keep you updated via this thread regarding the issue status.

We are sorry for the inconvenience caused.

Best Regards,

Hi,

any news here?
I have similar problem - PDF is 1,5Mb and then I loop through pages and then annotations when I use simple code like:

ta = New Text.TextFragmentAbsorber()
ta.Visit(objPage)
For Each tf As Aspose.Pdf.Text.TextFragment In ta.TextFragments
TextToDisplay = TextToDisplay + tf.Text
Next

memory consumption is increasing for 300Mb.
Considering file size of 1,5Mb 300Mb is too much.

What would you recommend to free memory?
I tried to enforce garbage collection but it did not release memory.

Thanks,
Oliver

Hi Oliver,


Thanks for your inquriy. I am afraid the reported issue is still not resolved and it is pending for investigation in the queue.

Furthermore, we will appreciate it if you please share your source PDF document here as well. As usually issues vary from file to file, so we will test your document and provide you information accordingly.

We are sorry for the inconvenience.

Best Regards,
Hi Tilal,

please find small project and sample file.
PDF is around 1,5Mb but it uses more than 160Mb of memory and code is extremely simple:

public void MemoryExplosion()
{
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document("MemoryIssue.pdf");
String TextToDisplay = "";
foreach (Aspose.Pdf.Page objPage in pdfDocument.Pages)
{
foreach (Aspose.Pdf.Annotations.Annotation objAnnotation in objPage.Annotations)
{
Aspose.Pdf.Annotations.LinkAnnotation objLinkAnnotation;
objLinkAnnotation = (Aspose.Pdf.Annotations.LinkAnnotation)objAnnotation;

Aspose.Pdf.Rectangle rect = objLinkAnnotation.Rect;
TextFragmentAbsorber ta = new TextFragmentAbsorber();
TextSearchOptions textSearchOptions = new TextSearchOptions(rect);

TextFragmentCollection textFragmentCollection = ta.TextFragments;

ta.TextSearchOptions = textSearchOptions;
ta.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.AdjustSpaceWidth;
ta.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
ta.Visit(objPage);

foreach (Aspose.Pdf.Text.TextFragment tf in textFragmentCollection)
{
TextToDisplay = TextToDisplay + tf.Text;
}
}
}
}

Hi Oliver,


Thanks for sharing the source document and code. I have tested the scenario and noticed that API is consuming almost 300MB memory for shared 1.5MB file. It is different case than above logged issue as API successfully extracts text from your shared file in almost 5-6 seconds. However, I have logged an investigation ticket PDFNET-41259 in our issue tracking system and requested our product team to investigate the memory consumption and share some workaround to overcome this as well. We will keep you updated about the issue resolution progress.

We are sorry for the inconvenience.

Best Regards,
Hi Tilal,

text extraction works as it should so this is not a problem :).
Issue is in objPage.Accept(ta); as when you have document with a lot of pages memory consumption explodes. It looks the same like in original post.

Basically loop over all pages will work if after each page is processed Aspose is releasing memory. Unfortunately I was not able to find any option to do that from my code so like you notices my 1.5Mb document requires almost 300Mb of memory. As I am processing documents in parallel you can understand that his can be overkill and will affect all customers :(.

Hi Oliver,


Thanks for your feedback. We have recorded your concern. Please note we have already grouped both issues and raised the issue priority. We will notify you as soon as our product team completes the issue analysis and share their feedback.

Thanks for your patience and cooperation.

Best Regards,

Hello

Any news on this issue? I’ve a 662 pages documents and the memory consumption is exploding…

Just to let you know, I’m using this workaround


for (int i = 1; i <= pdfDocument.Pages.Count; i++)
{
using (Document singleDocument = new Document()<span style=“font-family: “Courier New”;”>{
<span style=“font-family: “Courier New”;”> singleDocument.Pages.Add(pdfDocument.Pages[i]);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+", new TextSearchOptions(true));<span style=“font-family: “Courier New”; font-size: small;”>
<span style=“font-family: “Courier New”; font-size: small;”> singleDocument.Pages[1].Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
<span style=“font-family: “Courier New”; font-size: small;”>foreach (TextFragment textFragment in textFragmentCollection)
{
foreach (TextSegment textSegment in textFragment.Segments)<span style=“font-family: “Courier New”; font-size: small;”>{
<span style=“font-family: “Courier New”; font-size: small;”>/* my code*/>
<span style=“font-family: “Courier New”; font-size: small;”> }
}
}
<span style=“font-family: “Courier New”;”> }
<span style=“font-family: “Courier New”;”>}

Hi Philippe,


Thanks for your inquiry. As shared earlier within this thread that issue(s) vary from file to file. Sometimes they are caused by the complexity or the elements inside the document. Although for the testing purpose I have run a scenario by using your code snippet with my sample PDF document (~26 MB, 1718 Pages) and I was unable to notice memory leaks issue. We will really appreciate if you please share a sample document so that we can try to reproduce the issue in our environment and share our findings with you accordingly.

We are sorry for the inconvenience.


Best Regards,