Extracting original image data

ashmid_a · June 10, 2014, 7:47pm

I am using Aspose.PDF to extract images from a PDF file. The PDF file contains a bunch of images in JPG format (that is, they are encoded in the PDF file with the /Filter /DCTDecode option, after which comes the actual JPG data).

I’ve tried the Save method of the xImage object, and I’ve tried the GetNextImage method of the PdfExtractor object (see below). However, in both cases, the resulting JPG file is somewhat smaller than the original data encoded in the file. It is clear that Aspose is recompressing the data before it saves it as a JPG.

Instead, I’d like to access the actual JPG data, just as it appears within the PDF file. How can I get the actual JPG data for a given xImage object?

Here are the methods that I tried (unsuccessfully):

1]

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(args[1]);

// traverse through individual pages of PDF file

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++) {

// traverse through each image extracted from PDF pages

foreach (XImage xImage in pdfDocument.Pages[pageCount].Resources.Images) {

string savefilename = “image-” + pageCount + “.jpg”;

FileStream fs = new FileStream(savefilename, FileMode.CreateNew);

//save output image

xImage.Save(fs);

}

2]

PdfExtractor extractor = new PdfExtractor();

extractor.BindPdf(args[1]);

extractor.ExtractImage();

int i = 1;

while (extractor.HasNextImage()) {

Console.WriteLine("Getting image number " + i);

extractor.GetNextImage(“image-” + i + “.jpg”);

i++;

}

Note: I tried also specifying ImageFormat.Jpeg, but to no avail; I still receive a smalled, recompressed image. Instead, I’d like to be able to access the actual image data as stored in the PDF file, and to then write that out as a file.

codewarior · June 11, 2014, 1:37pm

Hi Avi,

Thanks for using our API.

Please share the resource PDF file and the extracted images, so that we can further test this scenario at our end. We are sorry for this inconvenience.

ashmid_a · June 12, 2014, 3:52am

Here's a directory with a single JPG file, and a PDF file containing that JPG file (which I created with Aspose.PDF):

https://www.dropbox.com/sh/fsxjh58vuiolrxf/AACPIaFu1xUhTs4E4Mu-p5K7a

As you can see if you look at the hex data for "newfile.pdf", the JPG is included within the PDF file as is, byte for byte (from location 0x010F to location 0x6D545). Indeed, this is the great thing about the /DCTDecode filter in PDF files: it allows the PDF file to contain a complete JPG file, without doing any sort of recompression or transcoding.

However, when I run extractor.GetNextImage() to extract the image (as detailed in my previous message), the resulting JPG is significantly smaller, and it is apparent that Aspose.PDF is not providing access to the original JPG data that is within the PDF file, but rather it is reencoding and recompressing it. Instead, I'd like to be able to use Aspose.PDF to extract JPG images from PDF files without any loss of quality. The JPG data is fully there, so it should be accessible without a problem.

How can this be accomplished with Aspose.PDF?

codewarior · June 13, 2014, 2:02am

ashmid_a: Here’s a directory with a single JPG file, and a PDF file containing that JPG file (which I created with Aspose.PDF):

Dropbox Link

As you can see if you look at the hex data for “newfile.pdf”, the JPG is included within the PDF file as is, byte for byte (from location 0x010F to location 0x6D545). Indeed, this is the great thing about the /DCTDecode filter in PDF files: it allows the PDF file to contain a complete JPG file, without doing any sort of recompression or transcoding.

However, when I run extractor.GetNextImage() to extract the image (as detailed in my previous message), the resulting JPG is significantly smaller, and it is apparent that Aspose.PDF is not providing access to the original JPG data that is within the PDF file, but rather it is reencoding and recompressing it. Instead, I’d like to be able to use Aspose.PDF to extract JPG images from PDF files without any loss of quality. The JPG data is fully there, so it should be accessible without a problem.

Hi Avi,

Thanks for sharing the resource files.

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-37075. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.