Extracting PDF Attachments (Memory Issue)

Xadeqd · October 5, 2010, 10:33am

I am running into a memory issue with the PDF.Kit's PDFExtractor with very large (500mb+) Portfolio PDFs.

Please advise- here is my code usage and comments.

PdfExtractor extractor = new Aspose.Pdf.Kit.PdfExtractor();
extractor.BindPdf(DocumentPath);
extractor.ExtractAttachment(); // This is where it fails.
ArrayList attachNames = extractor.GetAttachNames();
int counter = 0;
foreach (var name in attachNames)
{
extractor.ExtractAttachment(name.ToString());
ArrayList attachInfo = extractor.GetAttachmentInfo();
foreach (AsposePDFKit.AttachmentInfo info in attachInfo)
{
..... //extract the individual attachments
}
}

The OutOfMemory exception occurs on line 3, "extractor.ExtractAttachment();".

The reason I call ExtractAttachment() without specifying a specific attachment filename is because I haven't found a way to get the attachment names without first calling ExtractAttachment(). Using extractor.GetAttachNames() before ExtractAttachment() yields an Object not Referenced exception, but I think it would solve the memory problem if I could just get the list of names so that I could then extract them individually, or ideally if there was a way to enumerate through attachments one at a time regardless of their filename so that so much memory wouldn't be spent preparing streams for all of the attachments at once.

I believe the problem is that I must call ExtractAttachment() before GetAttachNames() to get the list of attachment names, and ExtractAttachment() is potentially very memory intensive because it prepares streams for all of the attachments. Please let me know if there is a memory-safe way to use the PdfExtractor.

Thank you for any help

shahzadlatif · October 6, 2010, 8:33am

Hi Duncan,

We’re investigating your issue at our end and you’ll be updated shortly.

We’re sorry for the inconvenience.
Regards,

shahzadlatif · October 7, 2010, 1:13am

Hi Duncan,

We’re unable to reproduce this behavior at our end. Could you please try to download the latest version (4.9.0) and test with that? If you still find the same issue then please upload the sample input file at some FTP server and share the URL with us, so we could reproduce the issue using your particular scenario. I would also like to share with you that ExtractAttachment method performs the initialization tasks, so it can’t be bypassed; nevertheless, we’ll try to either improve the memory utilization or provide some way so could get the attachments easily.

We’re sorry for the inconvenience and looking forward to help you out.
Regards,

Xadeqd · October 7, 2010, 8:17am

4.9.0 still has the issue. Here is a smaller, 132mb pdf, that is small enough to not cause an out-of-memory exception but is large enough to easily see the memory problem as you step through code.

http://www.duncancooper.net/HelloWorld.pdf

I made it using the Aspose.PDF component (which also has memory issues; I couldn’t include a 250mb text file as an attachment without a memory exception but this a 132mb text file was okay).

Upon calling ExtractAttachments() on this 132mb file you will see the memory usage in task manager go up by approximately 265mb.

shahzadlatif · October 7, 2010, 2:07pm

Hi Duncan,

I have tested this issue at my end using the shared sample file and the code snippet. I have noticed that the memory goes up to 272mb at my end and that ExtractAttachment method is not returning the control. Can you please confirm that it is not returning the control at your end as well? Or you have been successful to extract this file? Although the high memory usage issue is there, I just wanted to confirm the duration of the process so that the whole scenario is clear to our team regarding this issue.

We’re sorry for the inconvenience and looking forward to help you out.
Regards,

Xadeqd · October 7, 2010, 2:29pm

Thanks again for looking into this.

Yes, I can successfully extract from that PDF file and I do regain control from ExtractAttachment() after about 2 seconds. Here is some code that will extract the attachment (previously commented out).

counter++;
string destPath = @"C:\data\Attachment\" + counter + GetExtension(info.FileName);
FileStream outStream = File.OpenWrite(destPath);
info.AttachmentStream.WriteTo(outStream);
outStream.Flush();
outStream.Close();

shahzadlatif · October 8, 2010, 1:55am

Hi Duncan,

I have logged this issue as PDFKITNET-20631 in our issue tracking system for further investigation. Our development team will look into this issue in detail and provide you the fix accordingly. You’ll be notified via this forum thread once it is resolved.

We’re sorry for the inconvenience.
Regards,

aspose.notifier · November 10, 2010, 12:28pm

The issues you have found earlier (filed as 20631) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.