Issues extracting attachments from PDF Portfolios

Hello,

I am using Aspose.Pdf v. 6.8 and the old Aspose.Pdf.Kit (5.5) to extract file attachments from PDFs (both embedded files and page file annotations).

There are issues with certain PDFs, mostly very large PDF Portfolios (100MB >).

The old pdf.kit version runs into an out of memory error when calling PdfExtractor.ExtractAttachment(). If I ignore the error and continue, it does at least retrieve some of the attachments (until the out of memory error) which can be saved using PdfExtractor.GetAttachment().

In the new version, I have tried using both the Aspose.Pdf.Facades.PdfExtractor and Aspose.Pdf.Document classes to extract the attachments. There are 2 major issues here:

1 - There are numerous pdf files that it works with Pdf.Kit but not the new version. I can produce samples, but they are large files so I will need an FTP site to upload them to. For these files, all of which are PDF portfolios (and the PDFFileInfo class correctly identifies them as such with the "HasCollection" property), no attachments are found. The EmbeddedFiles collection is empty for the Pdf.Document class, and PdfFileExtractor returns no files either.

2 - In the case where there is a very large single attachment to a Pdf (my sample file that I can send has a 600 MB file embedded in the PDF), there appears to be no way to extract the file without reading the whole thing into memory. Instead of throwing an out of memory exception, it keeps going but at an exceptionally slow speed and the output file does not save the whole file if I let it finish (which took well over an hour, though I do not know exactly how long). I can tell this using Task Manager as soon as I access "Contents" of the PdfFileSpecification class - initially the Memory Usage and I/O read bytes increase very quickly to about 500 MB/280 MB, then it hits a threshold - the memory usage drops to ~300 where it stays and then the I/O read bytes slows to a crawl, about 10-15 seconds per MB.

I can provide an FTP link to sample files or upload them to a site of your choosing.

Any help would be appreciated.

Thanks

Doug

Hello Doug,


Thanks for using our products and sorry for replying you late.

I am trying to replicate this issue by using sample PDF documents that I have and will update you with my findings shortly. We are sorry for your inconvenience.

Hello Doug,


Thanks for your patience.

I have further tested the scenario of attachments extractions and I am able to reproduce that files are not being extracted either by using EmbeddedFileCollection or PdfExtractor class. For the sake of correction, I have logged this problem as PDFNEWNET-33573 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are really sorry for this inconvenience.

Thank you.

Doug

Hi Doug,


Thanks for your patience.

We have further investigated this problem and have concluded that in order to avoid null reference exception in your example, check MIME type is not null:

<span style=“font-size:
10.0pt;font-family:“Courier New”;color:blue;mso-no-proof:yes”>if<span style=“font-size:10.0pt;font-family:“Courier New”;mso-no-proof:yes”>
(fileSpecification.MIMEType != null)<o:p></o:p>

MessageBox.Show("Mime Type: {0}" + fileSpecification.MIMEType.ToString());


In case you still encounter the same problem or you have any further query, please feel free to contact. We are sorry for your inconvenience.

The issues you have found earlier (filed as PDFNEWNET-33573) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.