Extracting Office Documents Core, Custom and Extended Properties

Hi Team!

I want to extract metadata from OFFICE documents (docx, pptx, xlsx) based on the Inspection and Sanitization Guidance (OFFICE 2007.4.8, OFFICE 2007.4.9, OFFICE 2007.4.10).

I created a .Net 6 project to extract these properties but some of the fields are not extracted. For example:

  • Aspoes.Words:
    • ScaleCrop (bool)
    • LinksUpToDate (bool)
    • SharedDoc (bool)
    • HiperLinksChanged (bool)
    • AppVersion (float) Maybe this is the “Version“ field?
  • Aspose.Slides:
    • HeadingPairs (binary)
    • TitlesOfParts (string)
    • LinksUpToDate (bool)
    • HyperlinksChanged (bool)
    • ScaleCrop (bool)
    • Words (int)
    • Paragraphs (int)
    • Slides (int)
    • Notes (int)
    • HiddenSlides (int)
    • MMClips (int)
    • TitlesOfParts (string)
  • Aspose.Cells:
    • HeadingPairs (binary)
    • TitlesOfParts (string)

If I use the Exiftool everything is extracted.

Could you help me how can I extract all metadata (Core, Custom and Extended properties) with Aspose Words, Slides and Cells?

Sample File:
SampleFiles.zip (52.0 KB)

.NET 6 Project:
OfficeMetadata.zip (2.1 KB)

Aspose.Cells 24.6.0
Aspose.Slides.NET 24.6.0
Aspose.Words 24.6.0

@erdeiga,

For Aspose.Cells, I tested your scenario/case using your template XLSX file and sample code snippet with Aspose.Cells v24.6. Here is the console output I got:


--- Cells Builtin Properties e:\test2\extracting office documents\Metadata.xlsx ---
Title                Title
Subject              Subject
Author               Windows User
Keywords             Keywords
Comments             Comments
LastSavedBy          Windows User
CreateTime           6/21/2024 11:55:52 AM
LastSavedTime        7/1/2024 9:54:26 AM
Category             Category
NameOfApplication    Microsoft Excel
Security             0
ScaleCrop            False
Manager              Manager
Company              Company
LinksUpToDate        False
SharedDoc            False
HyperlinkBase        https://www.google.com/
HyperlinksChanged    False
Version              16.0300

--- Cells Custom Properties e:\test2\extracting office documents\Metadata.xlsx ---
Text                 Text
Number               1234
Bool1                True
Bool2                False
Date                 1/1/2024 10:00:00 AM

I also evaluated your template XLSX file by opening the file into MS Excel 2010 and 2019 but I could not spot/find your mentioned properties (core and custom, etc.). See the screenshots attached for your reference.
sc_shot1.png (8.4 KB)
sc_shot2.png (9.4 KB)

How could I view/get those missing attributes/properties in MS Excel manually?

@erdeiga,
As for Aspose.Slides, I am working on the issue and will get back to you soon.

@erdeiga Aspose.Words store following metadata:

  • LinksUpToDate (bool) - you can get it with the .doc file, not .docx
  • AppVersion (float) - stored as Version

I have created an issue WORDSNET-27154 to provide access to other metadata.

@erdeiga,
As for Aspose.Slides, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SLIDESNET-44626

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@erdeiga,

For Aspose.Cells, we have opened the following new ticket(s) in our internal issue tracking system to evaluate and investigate the extraction of HeadingPairs (binary) and TitlesOfParts (string) properties. We will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): CELLSNET-56107

Once we have an update on the ticket, we will inform you here.

@erdeiga

  • HeadingPairs (binary)
  • TitlesOfParts (string)
    These two properties cache the number and name of worksheets in the file, and they are duplicated from the settings of worksheets in the file.
    To avoid maintaining two sets of data, we did not read in these two attributes.
    And you get them as the following:
  • HeadingPairs (binary) : WorksheetCollection.Count
  • TitlesOfParts (string): iterate all sheets in the WorksheetCollection to their names.