Not able to extract complete image from PDF

mallikarjun.sidveer · April 10, 2019, 12:48pm

Hi Team,

I am trying to extract all images from PDF file. But images are not extracting as same as in PDF file.
Single image is extracting in multiple parts.

Below is the code i am using.

private void ExtractImagesFromPdf(string pdfFilePath)
{
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdfFilePath);

  int cnt = 0;
  foreach (Page page in pdfDocument.Pages)
  {
    foreach (XImage image in page.Resources.Images)
    {
      FileStream outputImage = new FileStream(DataDir + "ExtractedImage_" + cnt + ".png", FileMode.Create);

      image.Save(outputImage, ImageFormat.Png);
      outputImage.Close();
      cnt++;
    }
  }
}

Attached sample PDF document for your reference.Main_Figure_Issue.pdf (2.9 MB)

Farhan.Raza · April 10, 2019, 9:03pm

@mallikarjun.sidveer

Thank you for contacting support.

We have extracted attached images with Aspose.PDF for .NET 19.4. Would you please elaborate which page number and image file has the issues so that we may investigate further to help you out.

ExtractedImages.zip

mallikarjun.sidveer · April 11, 2019, 4:12am

@Farhan.Raza
Thanks for your support.

Images in page 32 and 33 are having issue. Please check in your attached zip file images “ExtractedImage_11.png to ExtractedImage_15.png” are not extracted properly. These images are not looking same as in PDF file.

mallikarjun.sidveer · April 11, 2019, 12:07pm

Hi Team… Any update on this??

Farhan.Raza · April 11, 2019, 12:22pm

@mallikarjun.sidveer

Thank you for the details.

The images do not look identical because in PDF document there is some text over the image, like the numbers and JESPR62 on page 34. If you copy the image using Adobe Acrobat and then paste it in Paint application then you will notice the same image pasted as extracted by Aspose.PDF for .NET API.

CopyImage.PNG

We hope this will clarify any ambiguity. Please feel free to contact us if you need any further assistance.

mallikarjun.sidveer · April 11, 2019, 12:28pm

@Farhan.Raza

Thanks for your support.

Is the any option to crop it as an image including text over image using Aspose

Farhan.Raza · April 11, 2019, 12:34pm

@mallikarjun.sidveer

You may iterate through images one by one and get their position coordinates like LLX, LLY etc as explained in Working with Image Placement. Then use those coordinates to Convert a particular page region to Image.

mallikarjun.sidveer · April 12, 2019, 6:34am

Thanks @Farhan.Raza

mallikarjun.sidveer · April 12, 2019, 10:15am

@Farhan.Raza

can you please share me a sample code to crop pdf page with only image part.
How to get exact coordinates of image part of page to crop it.

Thanks.

Farhan.Raza · April 12, 2019, 8:26pm

@mallikarjun.sidveer

We have devised an approach with combination of two articles suggested above, where the basic idea to get exact coordinates is in the code:

Rectangle pageRect = new Rectangle(imagePlacement.Rectangle.LLX, imagePlacement.Rectangle.LLY, imagePlacement.Rectangle.LLX + imagePlacement.Rectangle.Width, imagePlacement.Rectangle.LLY + imagePlacement.Rectangle.Height);

However, to avoid duplication of pages we have used Distinct method which results in only one image even if the page contains several images. So you may modify the code as per your requirements as the basic idea for extracting the images with overlay text has been elaborated.

int[] array;
List<int> PageList = new List<int>();
var newDocument = new Document();
// Open document
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(dataDir + "Main_Figure_Issue.pdf");
MemoryStream ms = new MemoryStream();

// Create ImagePlacementAbsorber object to perform image placement search
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();

foreach (Page page in doc.Pages)
{
    // Accept the absorber for all the pages
    page.Accept(abs);

    // Loop through all ImagePlacements, get image and ImagePlacement Properties
    foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
    {
        // Get the image using ImagePlacement object
        XImage image = imagePlacement.Image;

        // Get rectangle of particular page region
        Aspose.Pdf.Rectangle pageRect = new Aspose.Pdf.Rectangle(imagePlacement.Rectangle.LLX, imagePlacement.Rectangle.LLY, imagePlacement.Rectangle.LLX + imagePlacement.Rectangle.Width, imagePlacement.Rectangle.LLY + imagePlacement.Rectangle.Height);
        // Set CropBox value as per rectangle of desired page region
        page.CropBox = pageRect;

        PageList.Add(imagePlacement.Page.Number);
    }
}
array = PageList.ToArray();
var distinct = array.Distinct();
foreach (var number in distinct)
{
    newDocument.Pages.Add(doc.Pages[number]);
}
// Save cropped document into stream
newDocument.Save(ms);

// Open cropped PDF document and convert to image
doc = new Document(ms);
foreach (Page page in doc.Pages)
{
    using (FileStream imageStream = new FileStream(dataDir + "image" + page.Number + "_out" + ".png", FileMode.Create))
    {
        // Create Resolution object
        Resolution resolution = new Resolution(300);
        // Create PNG device with specified attributes
        PngDevice pngDevice = new PngDevice(resolution);
        // Convert a particular page and save the image to stream
        pngDevice.Process(page, imageStream);
    }
}
ms.Close();

We hope this will be helpful. Feel free to contact us in case of any further assistance.

mallikarjun.sidveer · April 15, 2019, 4:31am

@Farhan.Raza

It’s working now. Thanks for your support.

mallikarjun.sidveer · April 16, 2019, 7:34am

@Farhan.Raza

Above your solution is working fine when PDF page contains single image, but if page contains multiple images then its not working. Please suggest solution to this.

Farhan.Raza · April 16, 2019, 7:01pm

@mallikarjun.sidveer

Please note that we have shared same description already, that in case of multiple images only one image is generated because of Distinct method. Basic idea has been elaborated for your kind reference which can be further enhanced or modified as per your requirements.