Problems with text extraction on vector-based PDFs


#1

Hi,

Aspose.PDF (19.7 for NET, and earlier) has problems with using TextAbsorber on some vector-based PDFs. I am seeing some serious memory leaks - running it on a 50 MB PDF uses 24 GB of RAM before failing. It it the same problem reported 4 years ago here: Out of Memory Exceptions

The vector PDFs can be very complex so I understand it might require a lot of memory, but how can we stop it using ALL the memory and causing out of memory? E.g.:

  1. Can we detect if a page has vector images? This was requested a year ago at Check if PDF page have vector images - is there any update on that?

  2. Can there be a timeout or interruption or memory limit to TextAbsorber, so if the timeout/limit is reached it can abort the text extraction?

  3. Can there be an option to skip / ignore vector images when using TextAbsorber?

  4. Is there an alternative way to TextAbsorber to extract all text from a PDF, that might not have the same problems?

Any information would be appreciated, thank you!


#2

@ast3

Thank you for contacting support.

Would you please share a sample PDF document via Google Drive, Dropbox etc. along with narrowed down code snippet so that we may investigate the scenario. About PDFNET-36137, we are afraid it has not been resolved yet.

About timeout, a feature request to support InterruptMonitor is already logged as PDFNET-44380 which will tentatively be investigated around February 2020.

For further investigations you may please share the public download link for requested data, privately with us by clicking on my username and then message icon.


#3

Hi Farhan, thanks for your reply.

Please find attached a single extracted page that gives an example of the vector-based content that causes problems: input.pdf (871.9 KB)

Extracting text using the following basic method with v19.7 takes around 30 seconds (much slower than documents with no vector graphics):

var pdf = new Aspose.Pdf.Document("input.pdf");
var absorber = new Aspose.Pdf.Text.TextAbsorber();
absorber.ExtractionOptions.FormattingMode = Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.MemorySaving;
absorber.Visit(page);

That might be OK for a single page, however sometimes a document has 50+ vector pages which means it takes over half an hour for single PDF.

The PDF already has a text layer, so why is it so slow to read the text? Is there any way to make it ignore the vector parts and only get the text? Or is there an alternative method to speed up text extraction?

Thanks!


#4

@ast3

Would you please share a similar document with more pages that takes several minutes to process because that will help us investigate the scenario better.


#5

Hi Fahran,

A link to a sample PDF file (via Dropbox) is in the attached zip: Text extraction issue.zip (1.3 KB)

The code to demonstrate the issue (against Aspose.PDF for NET 19.7) is this:

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document("input.pdf");
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();

foreach (Aspose.Pdf.Page page in pdf.Pages)
{ 
	sw.Restart();
	
	Aspose.Pdf.Text.TextAbsorber absorber = new Aspose.Pdf.Text.TextAbsorber();
	absorber.ExtractionOptions.FormattingMode = Aspose.Pdf.Text.TextExtractionOptions.TextFormattingMode.MemorySaving;

	page.Accept(absorber);

	Console.WriteLine("Page " + page.Number + ": " + sw.ElapsedMilliseconds + "ms");
	page.Dispose();
}

On my computer (Windows 10, Intel i5-5200U @2.2GHz, 16GB RAM, no other workload) this takes 86 minutes to process the sample PDF file. Some pages are relatively fast, but others take much longer, e.g. page 67 took 806 seconds (13 minutes) to process. It also uses all available RAM.

I understand it is a large and very complex PDF. But is there some way to detect it might take very long time, e.g .detect/ignore vector parts? Or implement a time-out?

Thank you


#6

@ast3

Thank you for sharing requested data.

We have been able to reproduce it in our environment. A ticket with ID PDFNET-46696 has been logged to address your concerns. We will let you know as soon as some further update will be available in this regard.


#7

There is clearly a memory leak in Document object, because disposing and recreating the Document object for each page (per the code at Re: How to limit memory usage when extracting text from large PDFs?) releases the memory, though this is very inefficient for multi-page PDFs.

The fact memory is freed by simple Dispose (of the Document object) indicates the problem is managed objects not being disposed within that class, which should be straight-forward to fix (even via Document.FreeMemory() method).

Can you advise when this will be fixed. If not soon, I will seriously need to look at other products because this causes big problems.

Thank you!


#8

@ast3

Thank you for following up.

We have recorded your comments under the same ticket and will share our findings as soon as the ticket is investigated. For now, we are afraid that ETA for resolution is not available yet. However, the request has been recorded and we will update you once any information will be available.