Reading Meta Data from PDFs

jrocke · August 9, 2011, 4:16pm

Hello aspose team,

I have been banging my head against a wall trying to code a solution. I have come to try aspose, and have messed about with your aspose.pdf.kt and aspose.pdf libraries and have come up near to empty.

I was wondering if your software has solutions to things such as ( all with READING pdf’s)
all fonts embedded
all font color black
all layers flat
check image extensions
multimedia(ie sound) exists

I saw some of these were asked in posts around 2008 and they did not exist, and it was mentioned they might be added later, but I have yet to find their existence in your modern libraries.

If functions like this exists, under what constructors would I work? There is no need to write code for me, just to inform me if these functions exist or not, and point me in the correct direction, because I could not find them in your documentation libraries or through Google.

Thanks.

codewarior · August 10, 2011, 4:11pm

jrocke:

I was wondering if your software has solutions to things such as ( all with READING pdf's)
all fonts embedded

Hello Jeremy,

Thanks for your interest in our products. As per your requirement, do you need to check if the contents inside PDF document are based over any particular font ? If so is the case, I am pleased to share that Aspose.Pdf for .NET supports the feature to search any particular text segment and get the font name information. For further information, please visit Search and Get Text From All the Pages of PDF Document

jrocke:

all font color black

You can also get the font color of the text string searched inside PDF document. Please checkout the previously shared link for more information.

jrocke:

all layers flat

Can you please share some details regarding this requirement ?

jrocke:

check image extensions

For this particular requirement, first you need to extract all the images from PDF document and then you need to grammatically check the extension of particular image file. For more information, please visit Extract Images from the PDF File

jrocke:

multimedia(ie sound) exists

First you need to extract the Multimedia attachment from PDF document and then you can check the particular extension to identify either it's a multimedia/sound file or not.

jrocke · August 10, 2011, 4:20pm

Thanks for the speedy response, before I try some of this stuff I will respond to the one thing.

----------------------------

jrocke:

all layers flat

Can you please share some details regarding this requirement ?
-----------------------------

I need to Verify that there is only one layer in the PDF(is that enough?)

Other then that, everything looks promising. To work I go.

codewarior · August 10, 2011, 11:52pm

Hello Jeremy,

I am sorry to inform you that the requested feature to get information regarding number of layers in a particular PDF document, is not yet supported. However for the sake of implementation, I have logged this requirement as PDFNEWNET-29786 in our issue tracking system under new features list. We will further look into the details of this requirement and will keep you updated on the status of correction. We apologize for your inconvenience.

jrocke · August 11, 2011, 10:30am

It should not be a big problem, for now you all have 90% of what we need which is much better then any other library I could find before.

Thanks for the help.

jrocke · August 11, 2011, 5:18pm

Looks like I have one more question.

I have fiddled around a bunch with the image idea.

Yet, whenever I am pulling the image in the way the link suggests I cannot get anything remotely close to something proper.

This could be due to my incompetence. Or misunderstanding of the PDF object itself but here is what i’m seeing

code as ->
foreach(Page p in pdfDocument.Pages)
{
foreach(XImage i in p.Resources.Images)
{
string extension = i.Name;
Console.WriteLine("Image Extension : {0} ", extension);
}
}

Where I have tested with 5ish random different PDF’s and the Name is always ‘X’ and Names is always ‘Im0’.

As far as I can find there is no way for me to get an extension. Any suggestions?

thanks again

jrocke · August 15, 2011, 2:18pm

bump

any news on the Images?

codewarior · August 16, 2011, 9:40am

Hello Jeremy,

Thanks for your patience.

We are working over this query and will get back to you shortly. We are sorry for your inconvenience.

codewarior · August 17, 2011, 10:59pm

<!–[if gte mso 10]> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:Arial; mso-bidi-theme-font:minor-bidi;}

<![endif]–>

Hello Jeremy,

Thanks for your patience.

I have further investigated the requirement to get names of images present inside PDF and as per our current understanding, I am afraid there is no mechanism to get Image names. Please ignore my previous message that caused this confusion.

When the image is added to Pdf, it is being converted to pdf-specific format and there is no direct match between original image format and image format added to Pdf. The Name property of the XImage is PDF internal image name used to refer the image in the resources collection. For example, common Name appearance is “Img0”, “Im1” etc.

Besides this, if you create a documents with Aspose.Pdf for .NET, and want to discover image name in future, there might be possibility to add some meta information when the image is added. So, if you agree, we may investigate the possibilities to add such information. Please note that this approach will only work within our product line and there are chances that this custom meta info is lost, when using some third party editors.

We apologize for your inconvenience.

jrocke · August 18, 2011, 10:32am

So am I correct in understanding there is not way to verify the extension of the image in a PDF(not made in Aspose) or to see if it is in a loss-less format?
In your post there is not way to see its name, but the information I really need is its format.

And would this be true for multimedia as-well?

Thanks for all your help so far.

codewarior · August 18, 2011, 12:10pm

<span style=“font-size: 10pt; line-height: 115%; font-family: “Arial”,“sans-serif”; color: black;”>Hello Jeremy,

As I have shared earlier, if the image is placed inside paragraphs collection of PDF document, I am afraid we might not be able to get the name of image file so it also makes it impossible to get the extension information of file.

Besides this, if Image file or Multimedia files are added inside PDF as an attachment, we can get information such as Name, Description, MIME type etc. Please visit the following link for further details on Get All the Attachments from a PDF Document and also Get Information of an Attachment.

aspose.notifier · October 15, 2021, 7:45pm

The issues you have found earlier (filed as PDFNET-29786) have been fixed in Aspose.PDF for .NET 21.10.