We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

PDF text extraction is taking a long time and CPU is pegged at 100%

I have am using Aspose.pdf version 6.6.0.0 on a Windows Server 2008 R2 machine to extract text from PDF. There is no Adobe Reader or any other type of PDF reader installed on the 2008 R2 machine.

The relevant code looks like this

using (Document pdfDocument = new Document(pathToPdf))
{
//create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
//get the extracted text
contents = textAbsorber.Text;
textAbsorber = null;
}

For the first 100 or so documents, text extraction is fast and I have a thread watching the extraction as well. If the extraction takes longer than 15 seconds per megabyte for the file, I stop the extraction and move on because it is most likely stuck.

So the maximum time I am willing to wait for extraction for a 2 megabyte file is 30 seconds.

This should be sufficient, but when extracting thousands of files, the process gets slower and slower until every single file is timing out.

Is there something wrong with the way I am extracting the text (in the code above) that is somehow leaking resources?

Any assistance or guidance is appreciated.

Thanks,

Greg

Hi Greg,<?xml:namespace prefix = o ns = “urn:schemas-microsoft-com:office:office” /><o:p></o:p>

Thank you for sharing the
details.<o:p></o:p>

Please download and try the
latest version of
Aspose.Pdf
for .NET v6.8
. We have made some improvements in Text Extraction
process and also controlled memory leak issues in this release. However, if you
still face the issue, please share your template PDF file with us. This will
help us identify the cause of the issue and fix it soon.

Sorry for the inconvenience,

I am downloading the update now. I will post back here if I have any other problems.

Thank you for your help!