PdfExtractor.ExtractText is too slow

rhrufftx · August 20, 2009, 12:25am

Version 3.5 (.NET) is 80 times slower than previous versions when calling PdfExtractor.ExtractText().

In a simple batch text extraction of 250 PDF documents, version 3.2 averaged 31 ms per document whereas version 3.5 averaged an astounding 2493 ms per document.

I have attached three small documents that demonstrate the issue:

3 pages of text that takes 10 seconds
1 page of text that takes 20 seconds
1 page of text that takes 380 ms (this document converted the fastest of the entire set, but it was still 10 times slower than the average of the previous version).

I am eager to use the new version because it eliminates the unmanaged code and handles other Western languages by improving support of the characters in the upper ANSI range. But the performance has got to be improved. I hope the attached files will help you find/fix the issue.

shahzadlatif · August 20, 2009, 6:32am

Hi Robert,

I have tested the issue at my end and found that one of the PDFs took the same time in 3.4 and 3.5. However, the other two files didn’t work with 3.4; an exception was thrown while extracting text using 3.4. Of course, these issues were resolved in 3.5. However, in order to see if the performance of PdfExtractor class can be improved, our team will be looking into the matter and you’ll be updated via this forum. This issue is logged as PDFKITNET-10150 in our issue tracking system.

We’re sorry for the inconvenience.
Regards,

aspose.notifier · August 6, 2010, 2:14pm

The issues you have found earlier (filed as 10150) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.