Memory issue when searching through 1700 PDF documents

dfhchaa · August 31, 2023, 12:50pm

Hello,

I’m trying to search through ~1700 documents, but seem to be encountering a memory issue.
Memory slowly increases to 8.4 GB whilst going through the document.

VS diagnostics.png (9.2 KB)

I’m searching using the TextFragmentAbsorber in the following way:

// API params
public class SearchRequest
{
	[Required]
	public string Query { get; set; }

	[Required]
	public IFormFile Content { get; set; }
}

// Endpoint
public IActionResult SearchPdf([FromForm] SearchRequest searchRequest)
{
	var hits = 0;

	using (var ms = new MemoryStream())
	{
		searchRequest.Content.CopyTo(ms);
		using (var document = new Aspose.Pdf.Document(ms))
		{
			var textSearchOptions = new TextSearchOptions(true);
			textSearchOptions.IgnoreResourceFontErrors = true;

			foreach (var page in document.Pages)
			{
				var textFragmentAbsorber = new TextFragmentAbsorber(searchRequest.Query, textSearchOptions);
				page.Accept(textFragmentAbsorber);

				hits += textFragmentAbsorber.TextFragments.Count;
			}
		}
	}

	return Ok(hits);
}

I’m not comfortable deploying this to our production environment as is.

Could you clarify this memory consumption for me? Is this an issue or normal when searching through ~1700 document?

sergei.shibanov · August 31, 2023, 3:42pm

@dfhchaa
Since I don’t have these 1700 documents and I can’t try - I’ll ask you to put a forced call to the garbage collector

GC.Collect()

(for each iteration for begin) and see how the memory consumption changes.

dfhchaa · September 1, 2023, 7:13am

Hello Sergei,

Memory consumption did fall down to around 2.7 GB, but performance took a massive hit. Without the forced garbage collection, searching through all 1700 documents took around 45-60 seconds. With forced garbage collection, the search takes ~49 minutes.

sergei.shibanov · September 1, 2023, 5:20pm

@dfhchaa
The performance hit is expected - this is the effect of adding GC.Collect(). I asked to add it (especially in each iteration) only for check.
Please show the code you use that works with the library.

dfhchaa · September 4, 2023, 6:25am

Hello Sergei,

The only difference I made was to include GC.Collect() at the start of the loop, within the controller method (public IActionResult SearchPdf)

foreach (var page in document.Pages)
{
    GC.Collect()
    
    var textFragmentAbsorber = new TextFragmentAbsorber(searchRequest.Query, textSearchOptions);
    page.Accept(textFragmentAbsorber);

    hits += textFragmentAbsorber.TextFragments.Count;
}

Please let me know if you require further information

sergei.shibanov · September 4, 2023, 1:24pm

@dfhchaa
I would like to see a more complete code - with the opening of documents and how these 1700 documents are typed, how they are processed.

dfhchaa · September 5, 2023, 6:19am

Ah, gotcha.

Below is a snippet of the code reading document content, and sending these to the API using an HTTP client.

// AsposeClient.cs

public class AsposeClient : IPdfClient
{
	private readonly HttpClient httpClient;

	public AsposeClient(
		HttpClient httpClient,
		IOptions<AsposeConfiguration> config)
	{
		this.httpClient = httpClient;
		this.httpClient.BaseAddress = new Uri(config.Value.BaseUrl);
	}

	public async Task<int> SearchPdfAsync(string query, byte[] documentContent, string fileName)
	{
		var request = new HttpRequestMessage(HttpMethod.Post, 
			"api/document/pdf/search");

		var requestContent = new MultipartFormDataContent();

		var byteContent = new ByteArrayContent(documentContent);

		requestContent.Add(byteContent, "Content", fileName);
		requestContent.Add(new StringContent(query), "Query");

		request.Content = requestContent;

		var response = await httpClient.SendAsync(request);
		var responseContent = await response.Content.ReadAsStringAsync();
		if(!response.IsSuccessStatusCode)
		{
			throw new Exception(responseContent);
		}

		return int.Parse(responseContent);
	}
}

// DocumentService.cs
public class DocumentService
{
	private readonly IPdfClient pdfClient;
	
	public DocumentService(IPdfClient pdfClient)
	{
		this.pdfClient = pdfClient;
	}

	// Search through documents
	public async Task<bool> SearchDocuments(IEnumerable<Document> documents, string query, CancellationToken cancellationToken)
	{
		foreach (var document in documents)
		{
			if (cancellationToken.IsCancellationRequested)
				break;

			// Ignore if not PDF
			if (!document.Name.EndsWith(".pdf", StringComparison.InvariantCultureIgnoreCase))
				continue;

			var documentContent = await GetDocumentContentAsync(document.Filename);
			if (documentContent == null || documentContent.Length == 0)
				continue;

			try
			{
				var hits = await pdfClient.SearchPdfAsync(query, documentContent, document.Filename);

				if (hits > 0)
					return true;
			}
			catch
			{
				// Some documents contain errors and cannot be searched correctly. Ignore these.
				continue;
			}
		}

		return false;
	}
	
	// Get document content
	public async Task<byte[]> GetDocumentContentAsync(string fileName)
	{
		var filePath = Path.Combine(config.FilePath, fileName);
		var documentContent = await File.ReadAllBytesAsync(filePath);

		return documentContent;
	}
}

Please let me know if you need anything else

sergei.shibanov · September 5, 2023, 3:52pm

@dfhchaa
Thank you.
I have not seen how work is being done directly with PDF documents - so I will note that Aspose.Pdf.Document is supporting IDisposable and should be used with using.

using var doc = new Document();

or

(using var doc = new Document())
{
}

Do you haven’t this omission?

dfhchaa · September 11, 2023, 11:05am

Hello Sergei,

Sorry for the late reply, I’ve had a couple of days vacation.

As you can see from my original question, I did realize that Document is supporting IDisposable, as seen in the below code sample.

using (var document = new Aspose.Pdf.Document(ms))
{
    var textSearchOptions = new TextSearchOptions(true);
    textSearchOptions.IgnoreResourceFontErrors = true;

    // Abbreviation...
}

This is the whole reason for my confusion. I’m disposing the document, but it doesn’t seem to release memory.

sergei.shibanov · September 11, 2023, 3:28pm

@dfhchaa

Yes, that’s right, sorry, I missed it.

I should reproduce this in my environment and if I don’t see a clear reason, set a task to the development team.
Can you somehow format this so that I can reproduce it myself? Maybe try working with a thousand copies of some file you might attach?

dfhchaa · September 12, 2023, 6:17am

I’ll definitely give it a go.

I’ll post a code sample once I’m done, and attach one of the files in question.

sergei.shibanov · September 12, 2023, 8:11am

@dfhchaa
Yes OK