ExtractText() problems

Hi



I’m trying to extract text from pdf via ExtractText():



PdfExtractor extractor = new PdfExtractor();

extractor.BindPdf(fileName);

extractor.ExtractText();

using (MemoryStream ms = new MemoryStream())

{

extractor.GetText(ms);

ms.Seek(0, SeekOrigin.Begin);

using (StreamReader sr = new StreamReader(ms))

{

string result = sr.ReadToEnd();

return result;

}

}



but it’s very slow… after calling extractor.GetText(ms) - there is exception ObjectDisposedException.



It’s reproduced on all pdf files I try to extract text.

So, is this expected behaviour or maybe there is some workaround?



Thanks

Hi acex,

Thanks for considering Aspose.

I have reproduced the error you mentioned above.

We will check out this issue and respond you quickly.

Regards,

Please download the new dll from here and test it.

Any other question is welcome.

Best regards.

Hi, thanks for reply



I downloaded new dll… Seems it works, but extremely slow (I have 6 page pdf and try to extract first two pages - it takes several minutes!). At least exception is gone. But wait for 5-7 minutes to extract small text from pdf is not acceptable for me.



Is there some workaround to avoid this?



Thanks

Dear acex,

I have tested the long time execution problem with a sample pdf document in attachment and got the process finished in about 21 seconds.

Could you please tested the attached pdf and see how long it will take? If you got a similar result, the long time may be caused by the content of your document. We will also want to have a testing on the document if you could attach it here for our reference.

[codes]

string path = @"E:\QA\aspose.pdf.kit\examples\resources\text.pdf";
//string text;
DateTime d1 = DateTime.Now;
Console.WriteLine(d1);
PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(path);
extractor.ExtractText();
//extractor.GetText(@"E:\NeedToTest\Kit\6.21\Mytext.txt");
using (MemoryStream ms = new MemoryStream())
{
extractor.GetText(ms);
ms.Seek(0, SeekOrigin.Begin);
//text = ms.GetBuffer().ToString();
//ms.Close();
StreamReader sr = new StreamReader(ms);
string result = sr.ReadToEnd();
Console.WriteLine(result);
DateTime d2 = DateTime.Now;
Console.WriteLine(d2);
}

[Debug Info]

2006-9-4 16:30:09

2006-9-4 16:30:30

[Debug Situations]

Microsoft Visual C# .NET 69514-335-0000007-18942

CPU: 2.8GHz

RAM: 1G

Dear Kevin Zuo,



I resolved that problems with very long extraction time is deep in my application.

On supplied test pdf it works 4 minutes. But on test clean application only 29 seconds (2GHz, Ram: 1G) at first time. Other times are faster on different files.



Thanks for help.