Extract "raw" images?

georg.mahler · September 11, 2018, 10:25am

Hello Aspose support team,

I have a question regarding the extraction of images.

What I want to do is to extract images from PDF in their “original” state. I have PDF that are generated by Nuance. A TIF image is the input, Nuance reads the content with OCR and produces a PDF with one image (the source TIF) and invisible text.

What I need to do now is to extract that TIF in its original state. I tried the examples with and without Facades

But in any case, the image seems to be rendered in some way. For example, when I use Facades

pdfExtractor.ExtractImage();

a black and white TIF with G4 compressions is saved in 24bit format.

Is there a way to extract the “raw image data” as is, without any rendering?

Best regards,
Georg

Farhan.Raza · September 11, 2018, 6:11pm

@georg.mahler

Thank you for contacting support.

You can extract embedded contents from a PDF document in their as is form by converting the PDF to XML, as explained in Extracting embedded files from a PDF file.

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

georg.mahler · September 12, 2018, 2:37pm

Hey,

thanks for your reply and pointing me to that example. Unfortunately, the “Save as XmlMobi” doesn’t do a “raw extraction”.

image.png (23.4 KB)

As you can see on the screenshot, the original PDF has a size of 101 kB. The extracted image has 473 kB, way larger. The raw image in the PDF should be a TIF image with G4 compression, the extracted image is a PNG.

Did I do something wrong (I used the example code) or is this just the wrong way?

string fileNameStub = Path.Combine(Path.GetDirectoryName(fileName), Path.GetFileNameWithoutExtension(fileName));

// Load source PDF file
Document pdfDocument = new Document(fileName);
// Save output in XML format
pdfDocument.Save(fileNameStub + "_output.xml", SaveFormat.MobiXml);

Best regards,
Georg

Farhan.Raza · September 12, 2018, 7:58pm

@georg.mahler

Thank you for sharing your findings.

We are afraid there may not be any other approach or workaround to extract the images in their very raw form. Would you please share a source image and the PDF file containing that image so that we may log a feature request to investigate and implement this, as per your requirements.