Text extraction from PDF and memory utilization

mindaugasa · January 31, 2024, 10:16am

We using Aspose.PDF to extract text from PDFs.
When some PDFs are big and complex, sometimes we have quite big memory utilization issues.

Do these properties and methods can help optimize memory usage when extracting text from PDFs?
EnableObjectUnload and OptimizeResources()

Or maybe you can suggest other code improvements?

Sample code:

 var pdf = new Aspose.Pdf.Document(pMemStream);
 try
 {
     var textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
     pdf.Pages.Accept(textAbsorber);
     return textAbsorber.Text;
 }
 finally
 {
     pdf.Dispose();
 }

asad.ali · January 31, 2024, 7:59pm

@mindaugasa

These methods are not related to performance improvements. They are used to optimize PDF file size and other purposes.

However, you can improve performance by taking the text absorption at page level. Instead of extracting text from entire PDF at once, you can extract it page by page like below to improve the memory consumption:

foreach (var page in pdf.Pages)
{
 page.Accept(textAbsorber);
}

In case you still notice any issues, please share your sample PDF with us so that we can test the scenario in our environment and address it accordingly.

mindaugasa · February 1, 2024, 7:38am

Extracting page by page doesn’t make any difference.
But using OptimizeResources() does. And what is more strange, the same set of document is processed more quickly …

asad.ali · February 1, 2024, 2:33pm

@mindaugasa

It depends how you used this method. If you have used it before text extraction, it could have optimized the document for quick loading as well as reduced its size. However, its purpose is to optimize a large size PDF. Nevertheless, please let us know in case you have further concerns or you are facing any other issues.

mindaugasa · February 1, 2024, 2:46pm

OK, Thank you for the information.