Very poor performance and high memory usage with textFragmentAbsorber

nickax · March 1, 2018, 12:32pm

Hi There

We have purchased a developer OEM license for this product, but I am unable to use it in any production scenario due to performance and stability issues with text extraction.

i.e.
page.Accept(textFragmentAbsorber);
or
textFragmentAbsorber.visit(page)

Files that are processed successfully by both Adobe Reader, and by PDFTron/PDFnet - cause Out of memory exceptions with Aspose (see attachment)

image.png (64.9 KB)

Files that do process are extremely slowly (nearly 30 times slower than in PDFnet evaluation edition)

Before I put together steps to reproduce and sample PDF’s are you already aware of - and working on these issues ?

Thanks

Farhan.Raza · March 1, 2018, 6:17pm

@nickax

Thank you for contacting support.

Please elaborate if the issue occurs with every file you are working with, or if the exception is thrown for specific PDF files. Please share a narrowed down code snippet along with source PDF files so that we may investigate further to help you out.

Please also share a .zip project file along with all necessary resources to compare and explain performance issues you are noticing with our API.

nickax · March 2, 2018, 12:13pm

Hi Farhan

Thanks for your attention

I have prepared a sample project I will share by email - this is the output from one, working file.

There are 4 other files - exhibiting various issues in the project.

image.png (10.0 KB)

The files and failures are as follows:-

9781447958178 - Does work 481 pages processes circa 15 seconds - This is just about passable - but 30 times slower than our second choice library

9781447969662 - fails with unhandled exception of type " "

Apart from the [] was unhandled - i accept this is probably an invalid PDF.

9781139882019 - fails on page 3 - with a Null Reference Exception - although both the page object and the textFragmentAbsorber are valid

9781292249117 - crushingly slow, (around 2 seconds per page … before crashing on page 19 with out of memory exception)

9783642328237 - starts very slow and gets slower … after several minutes appeared to have locked up completely. (this is a very large PDF)

These are 5 random samples from tens of thousands of PDFs I need to be able to index.

Farhan.Raza · March 2, 2018, 6:48pm

@nickax

Please share all necessary files and then acknowledge here in this thread so that we may proceed to help you out.

nickax · March 5, 2018, 9:26am

Hi Farhan

I have resent the link to download the project to the address above

thanks

Nick

Farhan.Raza · March 5, 2018, 7:50pm

@nickax

Thank you for sharing requested data.

I have worked with the data shared by you and have been able to reproduce below issues. Following tickets have been logged in our issue management system for further investigation and resolution.

PDFNET-44331: 9781139882019 - fails on page 3 - with a Null Reference Exception
PDFNET-44332: 9781292249117 - crushingly slow and OOME
PDFNET-44333: Performance and memory consumption

However, you can avoid the problem with 9781447969662.pdf file by using below code snippet in your environment.

        PdfFileInfo info = new PdfFileInfo(path + "9781447969662.pdf");

        if (info.IsPdfFile)
        {
            //Your Code Here
        }

Please keep the files in your Google Drive, with link sharing on, for our reference. The issue IDs have been linked with this thread so that you will receive notifications as soon as the issues are resolved.

We are sorry for the inconvenience.