Extract custom metadata from portfolio PDFs?

russ.nichols · March 8, 2019, 1:15pm

I have the same question as

but it is not clear how I can get the metadata To, From, Subject, Keywords, etc that is
associated to portfolio PDF attachments. I can see how to extract attachments from the PDF, that
is well documented, but not to extract the associated metadata

For example see a screen shot of a portfolio PDF with an attachment that has
from, subject, to, date and other metadata fields

https://www.dropbox.com/s/30lxs895qiemcjp/PortfolioMetadata.png?dl=0

asad.ali · March 8, 2019, 6:33pm

@russ.nichols

Thanks for contacting support.

As per our understandings, you want to extract custom metadata properties of PDF which is present as an attachment (not the actual PDF which has attachment). Would you please confirm if our understandings are correct and share your sample PDF document with us. We will further test the scenario in our environment and address it accordingly.

russ.nichols · March 8, 2019, 8:46pm

Your understanding is correct.
This is a sample file:

I would like to extract From, Subject, Date and other metadata fields associated to the
file attached to this portfolio PDF.
And for example I would expect the value 5/3/2011 1:13:43 PM to be returned for the date field.

asad.ali · March 9, 2019, 2:53am

@russ.nichols

Thanks for sharing sample PDF.

Would you please try using following code snippet which returns respective values for metadata.

// Open document
Document pdfDocument = new Document(dataDir + "PorfolioWithCustomMetadata.pdf");

// Get particular embedded file
FileSpecification fileSpecification = pdfDocument.EmbeddedFiles[1];

// Get the file properties
Console.WriteLine("Name: {0}", fileSpecification.Name);
Console.WriteLine("Description: {0}", fileSpecification.Description);
Console.WriteLine("Mime Type: {0}", fileSpecification.MIMEType);


// Get the attachment and write to file or stream
byte[] fileContent = new byte[fileSpecification.Contents.Length];
fileSpecification.Contents.Read(fileContent, 0, fileContent.Length);

pdfDocument = new Document(new MemoryStream(fileContent));

DocumentInfo docInfo = pdfDocument.Info;
// Show document information
Console.WriteLine("Author: {0}", docInfo.Author);
Console.WriteLine("Producer: {0}", docInfo.Producer);
Console.WriteLine("Creation Date: {0}", docInfo.CreationDate);
Console.WriteLine("Keywords: {0}", docInfo.Keywords);
Console.WriteLine("Modify Date: {0}", docInfo.ModDate);
Console.WriteLine("Subject: {0}", docInfo.Subject);
Console.WriteLine("Title: {0}", docInfo.Title);

The value exists in fileSpecification.Description object. Furthermore, if we open attached PDF separately and check its properties, the created and modified dates are different than description. Would you please check the shared code snippet and share your feedback with us. We will further proceed accordingly.

russ.nichols · March 13, 2019, 12:50pm

Running your sample code we get the following:

PART 1.
Name: RE_ Bend aeration basins(5).pdf
Description: “Griffiths, Jim/CVO” 5/3/2011 RE_ Bend aeration basins(5).pdf
Mime Type: application/pdf

PART 2.
Author: null
Producer: Adobe PDF Library 15.0
Creation Date: 4/1/17 4:25 PM
Keywords: null
Modify Date: 4/1/17 4:25 PM
Subject: null
Title: null

PART 2. is metadata associated to the PDF attachment, which is different from the
metadata in the portfolio PDF, so this does not help us in this particular case

PART 1. is metadata associated to the file in the portfolio PDF, but does not include all the
information we are trying to extract. For reference we are trying to migrate from another PDF
library which provides the metadata in the portfolio as a dictionary with a set of arbitrary
keys and value. and for example provides for the file attached to this portfolio PDF a dictionary
with the following 7 entries:

Folder Location: Archive Folders/old projects/Old Projects/Bend
Date: 5/3/2011
From: Griffiths, Jim/CVO
Guid: 000000002076ACD34AFEF2458D464E15938F9BE4849D2400
Subject: RE_ Bend aeration basins(5).pdf
To: Menniti, Adrienne/PDX, Griffiths, Jim/CVO, Burton, Kevin/BEL, Brown, Gene/CVO
Cc: Elkins, Lori/CVO, Rose, Sterling/CVO, Leaf, William/BOI

So the FileSpecification.Description field that Aspose PDF returns, collapses 3 of these fields
(From Date and Subject) into a single one. We have to process arbitrary PDFs which might
have different metadata fields, so this API makes it hard even to retrieve 3 of the 7 fields we’d need.

asad.ali · March 13, 2019, 8:58pm

@russ.nichols

Thanks for sharing your detailed feedback.

We have logged an enhancement request as PDFNET-46141 in our issue tracking system for your requirements. We will surely investigate the ticket to implement required functionality in the API. As soon as there are some significant updates regarding ticket resolution, we will let you know. Please be patient and spare us little time.

We are sorry for the inconvenience.