Filter out duplicates when extracting images

YuriGubanov · April 8, 2022, 9:32pm

Hello! We are investigating Aspose.PDF package for data extraction. We try to extract images from a big 600-page PDF document with PdfExtractor using its HasNextImage/GetNextImage loop (Extract Images using PdfExtractor|Aspose.PDF for .NET) with subsequent storing the images to a database. The process takes almost 10 minutes and brings out about 650 images, including 600 duplicates of a logo from each document page. Unfortunately it is a very poor performance, as iTextSharp processes the same document in a mere second, bringing just 50 distinct images with no duplicate.

With iTextSharp we manually iterate over pages and filter out already extracted images by their reference ID, thus avoiding multiple byte extraction, image decoding, costly IO operations and so on. However, we are unable to find any way to somehow identify images with PdfExtractor or XImage.

We are interested to use Aspose.PDF for data extraction, as it has much better support for various image encodings inside PDF that iTextSharp has. Is there any way or workaround to avoid extracting duplicate images?

asad.ali · April 9, 2022, 7:45am

@YuriGubanov

Could you please share your sample PDF document for our reference? We will test the scenario in our environment and address it accordingly.

YuriGubanov · April 9, 2022, 10:23am

Here is a link: https://1drv.ms/b/s!AkLThqAden-cgVW5XUGebFT24SYM?e=12pmFa. Note that logging into OneDrive account is mandatory to download the file (this is a recently introduced requirement by Microsoft).

The file contains a watermark on each page which is every time extracted by PdfExtractor (or, alternatively, from Document.Pages.Resources.Images). Below is a code snippet from our testing app:

  Document doc = new Document($@"{SolutionFolder}\{PdfFileName}");
  PdfExtractor extractor = new PdfExtractor(doc);

  extractor.ExtractImage();
  int counter = 0;

  while (extractor.HasNextImage())
  {
    Console.Out.WriteLine(++counter);
    string outputFilePath = $@"{SolutionFolder}\Images\Image{counter}.jpg";
    extractor.GetNextImage(outputFilePath, ImageFormat.Jpeg);
  }

  Console.In.ReadLine();

YuriGubanov · April 9, 2022, 10:28am

To add: we use Aspose.PDF version 21.8.

asad.ali · April 10, 2022, 7:29pm

@YuriGubanov

We have downloaded the file and are testing the scenario. We will get back to you shortly.

asad.ali · April 11, 2022, 2:47pm

@YuriGubanov

We were able to replicate the similar issue in our environment while testing the scenario with Aspose.PDF for .NET 22.3. Therefore, an issue as PDFNET-51622 has been logged in our issue tracking system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

YuriGubanov · June 22, 2022, 2:59pm

Hello!

Is there any update on the issue?

asad.ali · June 22, 2022, 6:30pm

@YuriGubanov

We regret to inform you that the earlier logged ticket has not been yet resolved. It will be investigated and resolved on a first come first serve basis and we will surely inform you once we make some definite progress towards its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

YuriGubanov · July 28, 2022, 9:44am

Hello,

Is there any update on the issue or maybe at least time frame for the fix? We are using a competing solution to yours since yours is 10 times slower due to the issue.

asad.ali · July 28, 2022, 5:21pm

@YuriGubanov

We are afraid that the earlier logged ticket could not get resolved due to other pending issues in the queue. However, we have recorded your concerns and will definitely consider them during the investigation. We will inform you as soon as we have some updates about its fix.

We apologize for the inconvenience.