Unable to extract embedded docs for some pdf's

Hi,


I’m unable to extract embedded docs from many of the pdf files. Noticeably this number is pretty high in my test set. I’m attaching a document(PDFWithFileAttachmentAnnotation) that I couldn’t extract the embedded docs from. I’m using “EmbeddedFileCollection files = pdfDoc.EmbeddedFiles”.

And another is i’m getting a invalid cast exception while opening some kind of documents like the one that I have attached(CataLyst_3Attch) file for you to test it.

Please let me know as soon as possible.

Thanks,
John.

Hi John,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for sharing the template files.

jgabriel-ulx:

I'm unable to extract embedded docs from many of the pdf files. Noticeably this number is pretty high in my test set. I'm attaching a document(PDFWithFileAttachmentAnnotation) that I couldn't extract the embedded docs from. I'm using "EmbeddedFileCollection files = pdfDoc.EmbeddedFiles".

We have found your mentioned issue after an initial test. Your issue has been registered in our issue tracking system with issue id: PDFNEWNET-30920. We will notify you via this forum thread regarding any update against your issue.

jgabriel-ulx:

And another is i'm getting a invalid cast exception while opening some kind of documents like the one that I have attached(CataLyst_3Attch) file for you to test it.

We have found your mentioned issue after an initial test using your shared template file. Your issue has been registered in our issue tracking system with issue id: PDFNEWNET-30922. We will notify you via this forum thread regarding any update against your issue.

Sorry for the inconvenience caused,

Hi John,

Thanks for your patience. I am pleased to share that the issues reported earlier have been fixed and their hotfix will be included in upcoming release version. Please be patient and wait for the new v6.3.0.

Please notice that pdf document doesn’t have files which are embedded directly. There are 2 types of attachments (embedded files): 1) those which come through pdf document catalog /Names << /EmbeddedFiles this >> entry, 2) those which come through page file attachment annotations. Next 2 lines get all embedded files of type 1:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@“E:\ PDFWithFileAttachmentAnnotation.pdf”);
EmbeddedFileCollection embeddedFiles = pdfDocument.EmbeddedFiles;
foreach (FileSpecification fileSpecification in embeddedFiles)
{
//get the attachment and write to file or stream
byte[] fileContent = new byte[fileSpecification.Contents.Length];
fileSpecification.Contents.Read(fileContent, 0,
fileContent.Length);
FileStream fileStream = new FileStream(@“d:\pdftest” + fileSpecification.Name,
FileMode.Create);
fileStream.Write(fileContent, 0, fileContent.Length);
fileStream.Close();
}

However PDFWithFileAttachmentAnnotation.pdf doesn't have such attachments. So collection embeddedFiles is empty and it's normal. Nevertheless PDFWithFileAttachmentAnnotation.pdf has 2 attachments which come from page annotation (type 2). To extract them we can either use DOM or facade PdfExtractor. Lets use PdfExtractor. Next code snippet extracts 2 attachments into the current directory:


PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(@"PDFWithFileAttachmentAnnotation.pdf");
extractor.ExtractAttachment();
extractor.GetAttachment("");

And next sample does exactly the same but accessing each embedded file separately:

PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(@"PDFWithFileAttachmentAnnotation.pdf");
foreach (string name in extractor.GetAttachNames())
{
extractor.ExtractAttachment(name);
extractor.GetAttachment("");
}

Also PdfExtractor allows to extract embedded file into memory (see comments). Pay attention that PdfExtractor extracts all types of embedded files. Lets use DOM. I know that pdf document has only one page so I use this fact in the next code snippet:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"PDFWithFileAttachmentAnnotation.pdf");
foreach (Annotation annotation in pdfDocument.Pages[1].Annotations)
if (annotation is FileAttachmentAnnotation)
{
FileAttachmentAnnotation attachment = (annotation as FileAttachmentAnnotation);
using (BinaryReader reader = new BinaryReader(attachment.File.Contents))
using (BinaryWriter writer = new BinaryWriter(new FileStream(Path.GetFileName(attachment.File.Name), FileMode.Create)))
writer.Write(reader.ReadBytes((int)attachment.File.Contents.Length));
}

The issues you have found earlier (filed as 30920 ;30922 ) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.