Memory issue when searching through 1700 PDF documents

Hello,

I’m trying to search through ~1700 documents, but seem to be encountering a memory issue.
Memory slowly increases to 8.4 GB whilst going through the document.

VS diagnostics.png (9.2 KB)

I’m searching using the TextFragmentAbsorber in the following way:

// API params
public class SearchRequest
{
	[Required]
	public string Query { get; set; }

	[Required]
	public IFormFile Content { get; set; }
}

// Endpoint
public IActionResult SearchPdf([FromForm] SearchRequest searchRequest)
{
	var hits = 0;

	using (var ms = new MemoryStream())
	{
		searchRequest.Content.CopyTo(ms);
		using (var document = new Aspose.Pdf.Document(ms))
		{
			var textSearchOptions = new TextSearchOptions(true);
			textSearchOptions.IgnoreResourceFontErrors = true;

			foreach (var page in document.Pages)
			{
				var textFragmentAbsorber = new TextFragmentAbsorber(searchRequest.Query, textSearchOptions);
				page.Accept(textFragmentAbsorber);

				hits += textFragmentAbsorber.TextFragments.Count;
			}
		}
	}

	return Ok(hits);
}

I’m not comfortable deploying this to our production environment as is.

Could you clarify this memory consumption for me? Is this an issue or normal when searching through ~1700 document?

@dfhchaa
Since I don’t have these 1700 documents and I can’t try - I’ll ask you to put a forced call to the garbage collector

GC.Collect() 

(for each iteration for begin) and see how the memory consumption changes.

Hello Sergei,

Memory consumption did fall down to around 2.7 GB, but performance took a massive hit. Without the forced garbage collection, searching through all 1700 documents took around 45-60 seconds. With forced garbage collection, the search takes ~49 minutes.

@dfhchaa
The performance hit is expected - this is the effect of adding GC.Collect(). I asked to add it (especially in each iteration) only for check.
Please show the code you use that works with the library.

Hello Sergei,

The only difference I made was to include GC.Collect() at the start of the loop, within the controller method (public IActionResult SearchPdf)

foreach (var page in document.Pages)
{
    GC.Collect()
    
    var textFragmentAbsorber = new TextFragmentAbsorber(searchRequest.Query, textSearchOptions);
    page.Accept(textFragmentAbsorber);

    hits += textFragmentAbsorber.TextFragments.Count;
}

Please let me know if you require further information :slight_smile:

@dfhchaa
I would like to see a more complete code - with the opening of documents and how these 1700 documents are typed, how they are processed.

Ah, gotcha.

Below is a snippet of the code reading document content, and sending these to the API using an HTTP client.

// AsposeClient.cs

public class AsposeClient : IPdfClient
{
	private readonly HttpClient httpClient;

	public AsposeClient(
		HttpClient httpClient,
		IOptions<AsposeConfiguration> config)
	{
		this.httpClient = httpClient;
		this.httpClient.BaseAddress = new Uri(config.Value.BaseUrl);
	}

	public async Task<int> SearchPdfAsync(string query, byte[] documentContent, string fileName)
	{
		var request = new HttpRequestMessage(HttpMethod.Post, 
			"api/document/pdf/search");

		var requestContent = new MultipartFormDataContent();

		var byteContent = new ByteArrayContent(documentContent);

		requestContent.Add(byteContent, "Content", fileName);
		requestContent.Add(new StringContent(query), "Query");

		request.Content = requestContent;

		var response = await httpClient.SendAsync(request);
		var responseContent = await response.Content.ReadAsStringAsync();
		if(!response.IsSuccessStatusCode)
		{
			throw new Exception(responseContent);
		}

		return int.Parse(responseContent);
	}
}

// DocumentService.cs
public class DocumentService
{
	private readonly IPdfClient pdfClient;
	
	public DocumentService(IPdfClient pdfClient)
	{
		this.pdfClient = pdfClient;
	}

	// Search through documents
	public async Task<bool> SearchDocuments(IEnumerable<Document> documents, string query, CancellationToken cancellationToken)
	{
		foreach (var document in documents)
		{
			if (cancellationToken.IsCancellationRequested)
				break;

			// Ignore if not PDF
			if (!document.Name.EndsWith(".pdf", StringComparison.InvariantCultureIgnoreCase))
				continue;

			var documentContent = await GetDocumentContentAsync(document.Filename);
			if (documentContent == null || documentContent.Length == 0)
				continue;

			try
			{
				var hits = await pdfClient.SearchPdfAsync(query, documentContent, document.Filename);

				if (hits > 0)
					return true;
			}
			catch
			{
				// Some documents contain errors and cannot be searched correctly. Ignore these.
				continue;
			}
		}

		return false;
	}
	
	// Get document content
	public async Task<byte[]> GetDocumentContentAsync(string fileName)
	{
		var filePath = Path.Combine(config.FilePath, fileName);
		var documentContent = await File.ReadAllBytesAsync(filePath);

		return documentContent;
	}
}

Please let me know if you need anything else :slight_smile:

@dfhchaa
Thank you.
I have not seen how work is being done directly with PDF documents - so I will note that Aspose.Pdf.Document is supporting IDisposable and should be used with using.

using var doc = new Document();

or

(using var doc = new Document())
{
}

Do you haven’t this omission?

Hello Sergei,

Sorry for the late reply, I’ve had a couple of days vacation.

As you can see from my original question, I did realize that Document is supporting IDisposable, as seen in the below code sample.

using (var document = new Aspose.Pdf.Document(ms))
{
    var textSearchOptions = new TextSearchOptions(true);
    textSearchOptions.IgnoreResourceFontErrors = true;

    // Abbreviation...
}

This is the whole reason for my confusion. I’m disposing the document, but it doesn’t seem to release memory.

@dfhchaa

Yes, that’s right, sorry, I missed it.

I should reproduce this in my environment and if I don’t see a clear reason, set a task to the development team.
Can you somehow format this so that I can reproduce it myself? Maybe try working with a thousand copies of some file you might attach?

I’ll definitely give it a go.

I’ll post a code sample once I’m done, and attach one of the files in question.

@dfhchaa
Yes OK