Extract images from PDF document using Aspose.PDF for .NET - Extraction on 2nd page and other fail

Hi,

Enclsoed you find a PDF created with InDesign. The images on the first page can be extracted from the Resources properties of the page, but subsequent pages don’t report the images. If the document is converted into docx, the images are in the Word document as image objects.

Did I find a bug?Rijksdienst voor Oudheidkundig Bodemonderzoek; Kerkstraat 1, Amersfoort - hires.pdf (8.5 MB)

Thanks for letting me know…

@bartroozendaal,

Thanks for contacting support.

Can you please share complete environment details along with working sample project so that we may further investigate to help you out.

Please find enclosed a console app demonstrating the problem. As you can see, the tool reports 0 images on page 2 and further, while there are images in the PDF.BugAspose.zip (8.4 MB)

@bartroozendaal,

Thanks for contacting support.

I have observed your issue and like to inform that I have created investigation ticket with ID PDFNET-48166 in our issue tracking system to investigate and resolve this issue as soon possible.

We’re 14 days on. Is there any update on this issue?

@bartroozendaal,

I regret to inform that issue is still unresolved. As per our company policy, the first priority for investigation is given to the Paid Support i.e. Enterprise and Priority Support on first come first serve basis. After that the issues from normal support forum are scheduled for investigation on first come first serve basis. I request for your patience and we will share good news with you soon.

Is there any news? Any thought on when this will be looked at?

I paid $999 for this tool. To others that may not be a lot, but for me this is a big, big deal.

@bartroozendaal

Would you please try using following code snippet in order to extract images from PDF document as we tested using following code snippet and all images were extracted:

Document pdf = new Document(dataDir + "Rijksdienst voor Oudheidkundig Bodemonderzoek; Kerkstraat 1, Amersfoort - hires.pdf");
int index = 0;
foreach (Page page in pdf.Pages)
{
 ImagePlacementAbsorber imagePlacementAbsorber = new ImagePlacementAbsorber();
 page.Accept(imagePlacementAbsorber);
 foreach (ImagePlacement imagePlacement in imagePlacementAbsorber.ImagePlacements)
 {
  // Get the image using ImagePlacement object
  XImage image = imagePlacement.Image;
  string outputFileName = dataDir + "img_" + index + ".jpg";
  FileStream fs = new FileStream(outputFileName, FileMode.OpenOrCreate);
  image.Save(fs, 300);
  fs.Close();
  index += 1;
 }
}

Thank you for this. This seems to work much better. However, there is at least one file that I just tried, that is getting into an infinitive loop at page 14 of that document. The file is pretty big (450Mb). I will implement this workaround and run the logic over my complete set of files again. Should I find a smaller file that still gives an error, I can send that, otherwise I will get permission from my client to send the big file.

I’ll let you know…

@bartroozendaal

Thanks for your feedback.

You can surely share your problematic PDF document with us regardless of its size. Our forum supports maximum of 10MB of uploading. However, you can upload your larger file to some public file sharer e.g. Dropbox, Google Drive, and share the link with us. We will surely test the scenario in our environment and address it accordingly.

Hi, i managed to extract all images from the files based on the code you provided. This method seems to be a whole lot slower than what I had before, many files taking much more time than before.

Nevertheless, it looks like my problem is solved in this case, with an additional dose of patience that is. Thanks for you help in this.

@bartroozendaal

Thanks for sharing your feedback.

It is good to know that things have started working at your side.

An issue has already been logged in our issue tracking system for the previous approach you were using and we will surely share updates with you as soon as it is resolved. Please spare us some time.