Extract image from pdf along with image extension

Jaibir · August 28, 2024, 2:14pm

How to extract image from pdf along with image extension like png, jpeg, etf, gif etc.
Right now using following code but problem is not getting extension:

// Path to your PDF file
        string pdfFilePath = "input.pdf";

        // Load the PDF document
        Document pdfDocument = new Document(pdfFilePath);

        // Create an ImagePlacementAbsorber object
        ImagePlacementAbsorber imagePlacementAbsorber = new ImagePlacementAbsorber();

        // Iterate through each page in the document
        for (int pageNumber = 1; pageNumber <= pdfDocument.Pages.Count; pageNumber++)
        {
            // Get the page
            Page page = pdfDocument.Pages[pageNumber];
            
            // Accept the ImagePlacementAbsorber
            page.Accept(imagePlacementAbsorber);

            // Extract images from the page
            foreach (ImagePlacement imagePlacement in imagePlacementAbsorber.ImagePlacements)
            {
                // Create a stream from the image
                using (MemoryStream imageStream = new MemoryStream())
                {                    
                    // Set the output file path
                    string outputFilePath = $"image_page{pageNumber}_{imagePlacement.Index}.jpeg";

                    // Save the image to the desired format
                   SaveImage(imageStream, outputFilePath);
                }
            }
        }

Professionalize.Discourse · August 28, 2024, 2:15pm

@Jaibir

To extract images from a PDF file along with their respective file extensions (like PNG, JPEG, GIF, etc.), you can utilize the ImagePlacementAbsorber class in Aspose.PDF for .NET. However, the code you provided does not currently handle the extraction of the image format.

Here’s an updated version of your code that includes the logic to determine the image format and save the images with the correct file extension:

// Path to your PDF file
string pdfFilePath = "input.pdf";

// Load the PDF document
Document pdfDocument = new Document(pdfFilePath);

// Create an ImagePlacementAbsorber object
ImagePlacementAbsorber imagePlacementAbsorber = new ImagePlacementAbsorber();

// Iterate through each page in the document
for (int pageNumber = 1; pageNumber <= pdfDocument.Pages.Count; pageNumber++)
{
    // Get the page
    Page page = pdfDocument.Pages[pageNumber];
    
    // Accept the ImagePlacementAbsorber
    page.Accept(imagePlacementAbsorber);

    // Extract images from the page
    foreach (ImagePlacement imagePlacement in imagePlacementAbsorber.ImagePlacements)
    {
        // Create a stream from the image
        using (MemoryStream imageStream = new MemoryStream())
        {
            // Get the image format
            string imageFormat = imagePlacement.Image.ImageFormat.ToString().ToLower();
            string extension = imageFormat == "jpeg" ? "jpg" : imageFormat; // Handle JPEG format

            // Set the output file path with the correct extension
            string outputFilePath = $"image_page{pageNumber}_{imagePlacement.Index}.{extension}";

            // Save the image to the desired format
            SaveImage(imageStream, outputFilePath);
        }
    }
}

Key Changes:

Image Format Detection: The code now retrieves the image format using imagePlacement.Image.ImageFormat and converts it to a string. It also handles the case for JPEG format to ensure the correct file extension is used.
Dynamic File Extension: The output file path is constructed using the detected image format, allowing for the correct extension to be applied when saving the image.

This approach will ensure that the images extracted from the PDF are saved with their appropriate file extensions, making it easier to work with them later.

For more detailed information on extracting images from PDFs, you can refer to the official Aspose documentation on image extraction here and here.

Sources:

[1]: Extract Images From PDF File | Aspose.PDF for .NET API Reference
[2]: Extract Images using PdfExtractor|Aspose.PDF for .NET

Jaibir · August 28, 2024, 2:25pm

I’m getting error: XImage does not contain definition for ImageFormat and no accessible extension method ImageFormat accepting a first argument of type XImage could be found.

Jaibir · August 28, 2024, 2:39pm

@Professionalize.Discourse
any solution on this?

asad.ali · August 28, 2024, 8:02pm

@Jaibir

We apologize for the confusion. Please note that PDF format does not store this information about image in it. The images are added and stored into the PDF documents in the form of streams and we cannot define or determine their extension. In fact, you can only extract images in form of streams and save them to local drive in any kind of format as per your desire.

Furthermore, you can however determine whether an image is grayscale or RGB along with other information about the image. Please check below sample articles to work with the images in PDF:

Programming with Images | Aspose API References

Jaibir · August 29, 2024, 11:13am

Hi @asad.ali
Thank you for quick response.
Also facing 1 more issue while saving the stream on below line of code.

imagePlacementAbsorber.ImagePlacements[n].Save(maiImgStream, System.Drawing.Imaging.ImageFormat.Jpeg);

Error message is: {“Parameter is not valid.”}
Stack Trace: " at System.Drawing.SafeNativeMethods.Gdip.CheckStatus(Int32 status)\r\n at System.Drawing.Bitmap.SetResolution(Single xDpi, Single yDpi)\r\n at Aspose.Pdf.ImagePlacement.Save(Stream stream, ImageFormat format)\r\n at Helper.AsposePdfHelper.HighlightTextOCRAzurePdf(String docid, String docFolderPath, String pdfFilePath, List`1 sqlTblData, ILogger _logger) in r\AsposePdfHelper.cs:line 194"

asad.ali · August 29, 2024, 3:34pm

@Jaibir

If possible, could you please provide the sample PDF that you are processing along with the code snippet that you have used? We will test the scenario in our environment and address it accordingly.

In case you are using some non-Windows environment, please also try using Aspose.Pdf.Drawing instead of Aspose.PDF for .NET and do install libgdiplus package.

Jaibir · October 16, 2024, 12:33pm

Hi @asad.ali
Is it ok if we do communication via email. Like sharing problems and files ?
If yes then please share your email ID.

asad.ali · October 16, 2024, 6:50pm

@Jaibir

We prefer providing support via our dedicated forum. However, if you are not comfortable in sharing confidential files here, you can share them via private message (we just sent you) and continue having other discussion.

asad.ali · November 5, 2024, 8:21pm

@Jaibir

As per the information provided by you in the private message, we were able to replicate the issue in our environment.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58556

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.