I’m trying to search through ~1700 documents, but seem to be encountering a memory issue.
Memory slowly increases to 8.4 GB whilst going through the document.
I’m searching using the TextFragmentAbsorber in the following way:
// API params
public class SearchRequest
{
[Required]
public string Query { get; set; }
[Required]
public IFormFile Content { get; set; }
}
// Endpoint
public IActionResult SearchPdf([FromForm] SearchRequest searchRequest)
{
var hits = 0;
using (var ms = new MemoryStream())
{
searchRequest.Content.CopyTo(ms);
using (var document = new Aspose.Pdf.Document(ms))
{
var textSearchOptions = new TextSearchOptions(true);
textSearchOptions.IgnoreResourceFontErrors = true;
foreach (var page in document.Pages)
{
var textFragmentAbsorber = new TextFragmentAbsorber(searchRequest.Query, textSearchOptions);
page.Accept(textFragmentAbsorber);
hits += textFragmentAbsorber.TextFragments.Count;
}
}
}
return Ok(hits);
}
I’m not comfortable deploying this to our production environment as is.
Could you clarify this memory consumption for me? Is this an issue or normal when searching through ~1700 document?
Memory consumption did fall down to around 2.7 GB, but performance took a massive hit. Without the forced garbage collection, searching through all 1700 documents took around 45-60 seconds. With forced garbage collection, the search takes ~49 minutes.
@dfhchaa
The performance hit is expected - this is the effect of adding GC.Collect(). I asked to add it (especially in each iteration) only for check.
Please show the code you use that works with the library.
@dfhchaa
Thank you.
I have not seen how work is being done directly with PDF documents - so I will note that Aspose.Pdf.Document is supporting IDisposable and should be used with using.
Sorry for the late reply, I’ve had a couple of days vacation.
As you can see from my original question, I did realize that Document is supporting IDisposable, as seen in the below code sample.
using (var document = new Aspose.Pdf.Document(ms))
{
var textSearchOptions = new TextSearchOptions(true);
textSearchOptions.IgnoreResourceFontErrors = true;
// Abbreviation...
}
This is the whole reason for my confusion. I’m disposing the document, but it doesn’t seem to release memory.
I should reproduce this in my environment and if I don’t see a clear reason, set a task to the development team.
Can you somehow format this so that I can reproduce it myself? Maybe try working with a thousand copies of some file you might attach?