Aspose.Pdf.Kit.PdfExtractor seeming to freeze for a pdf document

Hello,


I am using Aspose.Pdf.Kit to extract text from pdf files. Aspose.Pdf.kit version is 2009.07.03. I am using the following code:

Aspose.Pdf.Kit.License l = new Aspose.Pdf.Kit.License();
l.SetLicense(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) + @"\Aspose.Total.lic");
Aspose.Pdf.Kit.PdfExtractor extractor = new Aspose.Pdf.Kit.PdfExtractor();
extractor.BindPdf(“test.pdf”);
extractor.ExtractText();
string tmpFilename = Path.GetTempFileName();
File.Delete(tmpFilename);
extractor.GetText(tmpFilename);

The call extractor.ExtractText(); for the attached document is taking a long time (I left it for 15 mins) and neither returns or throws an excpetion. It is similar to my previous post <a href="https://forum.aspose.com/t/120969 which was resolved. Do you know why the extract text is failing for this document?



Hi James,

I have reproduced the issue at my end and logged it as PDFKITNET-9886 in our issue tracking system. Our team will be looking into the matter and you’ll be updated via this forum as the issue is resolved.

I would also like to add that due to the variety of contents or the methods the PDF files are generated, content manipulation might cause issues sometimes. That’s why with some files ExtractText method fails. Nevertheless, our team will try to resolve the issue as soon as possible.

We’re sorry for the inconvenience.
Regards,

Thanks for your help. Even if the ExtractText method threw an excpetion or a timeout could be set it would be sufficient for the moment. Would either of these be possible?

Hi James,

I have updated our development team and they’ll be able to decide whether this is feasible or not. You’ll be updated accordingly.

Regards,


I'm having the same issue. Has there been a fix for this?

Or at least a workaround on how I can get the Text out of the PDF?

Hi Remy,

Please share the problematic PDF with us, so we could test the issue with your particular scenario. We’ll update you with the results accordingly.

We’re sorry for the inconvenience.
Regards,

It seems to happen for most bigger PDF's. This one is 3MB. I don't think this was an issue with the old pdf component, but since we updated to the newest one, we have the issue. The old one was likely from 2008.

And here is the code:

static private int CountInPDF(MemoryStream stream)

{

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Bind the input PDF document to extractor

extractor.BindPdf(stream);

//Extract text from the PDF document

extractor.ExtractText();

//extractor.GetText(@"C:\tmp\text.txt");

MemoryStream mem = new MemoryStream();

extractor.GetText(mem);

StreamReader reader = new StreamReader(mem);

mem.Seek(0, SeekOrigin.Begin);

string text = reader.ReadToEnd();

//Call GetWordCount method to get word count of the input PDF file

return CountWordsInString(text);

}

The slow part is extractor.ExtractText();

Hi Remy,

I have tested this issue at my end using the file you shared and the latest version (Aspose.Pdf.Kit for .NET 4.0.0), and extracted the text successfully; it didn’t take much time either. Please download the latest version and try at your end.

If you still find any issues or have some more questions, please do let us know.
Regards,

It does work, but it stucks at ExtractText for multiple seconds and the CPU load is at 100%. That has not been the case before. I’m using Version 4.0.0. My last version was 3.2.0.0.

Hi Remy,

Can you please share the time taken by the ExtractText method, for the attached file, at your end? It took about 37 seconds at my end. Moreover, can you please share the details regarding your machine and OS as well?

We’re sorry for the inconvenience.
Regards,

Ohhh, na, it only took about 10 seconds on my end. Have a Core 2 Duo with 2.33 GHz and 2GB or RAM. The Server is a quad core Xeon. Takes some time there too.

37 seconds is a little much just to extract the text from a normal PDF. I mean there is no OCR or anything like that necessary and I don't think it was that slow with the old version.

Hi Remy,

I have logged this issue as PDFKITNET-13888 in our issue tracking system. Our team will look into it and we’ll try to improve the performance in our upcoming versions.

We’re sorry for the inconvenience.
Regards,

The issues you have found earlier (filed as 9886) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

The issues you have found earlier (filed as 13888) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.