How to identify the Image type after extracting from a pdf file,

muralicn · August 10, 2020, 3:18am

Hi,

I am trying to parse a pdf document and trying to validate the type of images present in it like Jpeg, png , tiff, bmp, etc. I tried to extract the image but couldnt able to identify the right Image type. please help
PFB the code

private static void parseAndValidateImage(final byte[] fileContent) {
// Aspose pdf reader
final Document asposeDocument = new Document(new ByteArrayInputStream(fileContent));
final PageCollection pagecollection = asposeDocument.getPages();
int pages = pagecollection.size();
System.out.println("No of pages: " + pages);
for (int i = 1; i <= pages; i++) {
System.out.println("Processing page: " + i);
final Page page = pagecollection.get_Item(i);
final Resources resources = page.getResources();
if (!Objects.isNull(resources)
&& !Objects.isNull(resources.getImages())
&& resources.getImages().size() > 0) {
XImageCollection imageCollection = resources.getImages();
int noOfImages = imageCollection.size();
System.out.println("Page: " + i + " no of Images: " + noOfImages);
for (String name : imageCollection.getNames()) {
System.out.println("Page number: " + i + " Image name: " + name);
}

}
}

}

public static void main(String[] args) throws IOException {
System.out.println(“START_____”);
final byte[] fileContent =
org.apache.commons.io.FileUtils.readFileToByteArray(
new File("//Users//mcn//Documents//img_cmyk_icc_tiff.pdf"));
parseAndValidateImage(fileContent);
}

asad.ali · August 10, 2020, 8:30pm

@muralicn

Images in the PDF document actually stored as a stream of pixels coded with using a specific filter and parameters. But they can be stored in the required format for the purpose of standardization. By using asposeDocument .Pages[1].Resources.Images[1] we get pixel map that by default can be saved as JPEG image and you may store it with your desired file format. In case you have further inquiry, please feel free to ask.

muralicn · August 11, 2020, 6:33am

Thanks Asad,
Is it possible to identify the image type ( asposeDocument .Pages[1].Resources.Images[1]) like whether it is a Jpeg/Tiff/Png/Bmp etc. I need to write a custom validation to identify the type of image.
we are not creating any new files here just parsing a PDF doc.

-Murali

asad.ali · August 11, 2020, 8:49pm

@muralicn

As shared earlier, it is not possible to identify the image type because PDF does not store such information. You can though extract images and store them locally using the API in your required format.