Feature Request for Image Extraction from PDF

Looking through this forum, there appear to have been many comments, and some considerable frustration, around the performance of image extraction from PDF. Image extraction is an extremely common requirement in the scanning world, where the PDF format is regularly used as little more than a wrapper around a full-page scanned image (usually in TIF or JPEG format).

In such cases, most people simply wish to get back the original image in its native format. This should be a very fast operation, since it only involves returning the image stream in the PDF as-is with no additional processing.

However, as other posts and our own testing confirm, this is still a very slow operation in Aspose.PDF, and many 10s, possibly 100s, of times slower than in other PDF toolkits on the market. The problem, I assume, is that Aspose is rendering the image, and then re-encoding it to the desired format based on the Device selected.

Could I therefore request the following, which I believe would keep us and many other users satisfied in this area:

1) return additional information about an embedded image (specifically number of pixels high/wide, native format and bit depth), e.g. by extending the PDF.ImageInfo class.

2) provide a method that allows an image stream to be extracted as-is without any internal rendering, e.g. a Byte[] PDF.GetImageStream(page, imageindex) method.

More generally, while the use of pure .NET within Aspose.PDF is advantageous in many ways, it does seem to prevent Aspose from being competitive in performance terms with other toolkits, most of which use native C/C++ code for core operations. An increase in the use of unsafe code methods to boost performance in CPU-intensive operations such as image and text extraction would be very welcome.

GeorgeH:
However, as other posts and our own testing confirm, this is still a very slow operation in Aspose.PDF, and many 10s, possibly 100s, of times slower than in other PDF toolkits on the market. The problem, I assume, is that Aspose is rendering the image, and then re-encoding it to the desired format based on the Device selected.
Hi George,

Thanks for contacting support.

The time taken by our product to extract Image from PDF file depends upon the size/quality of image already present in PDF file and the complexity and structure of document itself. However if you are facing any issue related to performance while performing any operation, please share some sample PDF files so that we can test the scenario at our end.

GeorgeH:
1) return additional information about an embedded image (specifically number of pixels high/wide, native format and bit depth), e.g. by extending the PDF.ImageInfo class.
When using XImage class to extract images from PDF file, you can use Height and Width property of this class to get information regarding Height and Width of image.

GeorgeH:
2) provide a method that allows an image stream to be extracted as-is without any internal rendering, e.g. a Byte[] PDF.GetImageStream(page, imageindex) method.
Aspose.Pdf for .NET also offers the feature to extract image from PDF file and save the resultant image in stream object or you have even the save the resultant image on file system. However the product performs internal operations during image extraction. For further information, please visit Extract Images from the PDF File

We are sorry for your inconvenience.

codewarior:
The time taken by our product to extract Image from PDF file depends upon the size/quality of image already present in PDF file and the complexity and structure of document itself.


Quite possibly, but my point is that it is slow regardless, because it is doing more than it needs to (it is decoding the image when many of us would just like to get hold of it in its raw form). The attached is a very simple case - a bitonal TIF embedded in a PDF. With Aspose it takes about 1.3s to convert this to TIF on my system - with PDFLib (for example) it takes about 100ms.

codewarior:
When using XImage class to extract images from PDF file, you can use Height and Width property of this class to get information regarding Height and Width of image.


Yes, sorry, I forgot that height and width are available, but native format (TIF, JPEG etc.) and bit depth are not.

codewarior:
Aspose.Pdf for .NET also offers the feature to extract image from PDF file and save the resultant image in stream object or you have even the save the resultant image on file system. However the product performs internal operations during image extraction. For further information, please visit Extract Images from the PDF File


Yes, to be clear, by 'Stream' I'm referring to the stream of binary data within the PDF file, not .NET streams. The 'internal operations' are exactly the problem, and that's my request - that you provide an alternative way to access the image data without the expensive processing. I understand that there are complex cases, but in most instances, converting a CCITTFax encoded stream to a valid TIF only requires slapping a few TIF header fields on the front, as detailed here: http://blog.idrsolutions.com/2011/08/ccitt-encoding-in-pdf-files-converting-pdf-ccitt-data-into-a-tiff/ . In many cases, JPEG is even simpler because an entire valid JPEG file, headers and all, is contained within the stream.

Hi there,

Thanks for your feedback. After initial investigating, I've logged the requested enhancement/feature in our bug tracking system for further investigation and resolution, logged issues detail is as following. I have linked your request to these issues as well. You will be notified via this forum thread as soon as these are resolved.


  • PDFNEWNET-35063: Image extraction performance enhancement
  • PDFNEWNET-35064: Native image format and bit depth properties
  • PDFNEWNET-35065: Stream providing raw bytes of image in Pdf file.

Please feel free to contact us for any further assistance.

Best Regards,

Any news on these? These issues remain a serious problem for us.

George

Hi George,


Thanks for your inquiry. I’m afraid the reported issues are still not resolved. However, I’ve shared your concern with development team and requested them to share an ETA at their earliest. We will update you as soon as we get a feedback.

Thanks for your patience and cooperation.

Best Regards,

Hello,

We are currently evaluating these controls, which we need to use in a document management system to extract individual pages in PDFs splitting them based on pages containing barcodes. We also noticed the system was a bit slow - it took about 1 minute to extract 8 barcodes from an 8 page .PDF document. Is there a more efficient way of performing the barcode detection that waht you supplied in the sample code?

Also, is there any news on the feature requests:

  • PDFNEWNET-35063: Image extraction performance enhancement
  • PDFNEWNET-35064:
    Native image format and bit depth properties
  • PDFNEWNET-35065:
    Stream providing raw bytes of image in Pdf file.

Regards,


Martin J Dye


Hi Martin,

Thanks for your inquiry.
MartinDye:

We are currently evaluating these controls, which we need to use in a document management system to extract individual pages in PDFs splitting them based on pages containing barcodes. We also noticed the system was a bit slow - it took about 1 minute to extract 8 barcodes from an 8 page .PDF document. Is there a more efficient way of performing the barcode detection that waht you supplied in the sample code?

We are sorry for the inconvenience caused. Can you please share your sample code and document here? So we will investigate the issue and will provide you more information accordingly.

MartinDye:
"
Also, is there any news on the feature requests:

  • PDFNEWNET-35063: Image extraction performance enhancement
  • PDFNEWNET-35064: Native image format and bit depth properties
  • PDFNEWNET-35065: Stream providing raw bytes of image in Pdf file.

I'm afraid the requested features are still not implemented due to other priority tasks. We will update you as soon as we make some progress towards implementation of these features.

Thanks for your patience and cooperation.


Best Regards,

Hello,

Thanks for your response. I’ve enclosed 2 files; PDFProcessTest.zip - used for the testing and Barcodes Test 03.zip, which is the PDF from a scanner that we used to perform the test.

We used the method “Test1” which can be called by running the software and clicking the “Read Images from PDF using PDF extractor” button. You will need to type in the correct paths.

Thanks,

Martin.

Hi Martin,


Thanks for your inquiry. We’ve tested the scenario and noticed performance issue in image extraction. So we have logged an investigation issue as PDFNEWNET-36008 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue progress via this forum thread.

We are sorry for the inconvenience caused.

Best Regards,

Hi,


We are also looking to use the Aspose components in a project, but at present these performance issues are a concern that is preventing us from proceeding.

Can you please ask for a clear timeline for them to be resolved so that we can either plan for this date, or look at alternative options?

Thanks

Ben

Hi Ben,


Thanks for your inquiry. I am afraid above performance enhancements are a bit complicated. Although our development team is working to improve performance in image extraction. But I am afraid due to complexity and other high priority task, unfortunately we can’t provide ETA. We will keep you updated about the issues progress.


We are sorry for the inconvenience caused.


Best Regards,

The issues you have found earlier (filed as PDFNEWNET-36008) have been fixed in Aspose.Pdf for .NET 8.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

any updates on the PDFNEWNET items listed above. I looked through all the release notes from the last few years and didn’t see them (or missed them).

I’ve played with another PDF component that has much better performance at extracting images. I’d rather use Aspose because I’ve used it for years so I’ve built up alot of code to handle special scenarios based on PDFs that are “unique”. But the performance difference is a little large to ignore.

Many times I just need to extract the TIFF inside the PDF that a fax machine generated.

@mike.doerfler,
Unfortunately, there is no update available on the linked ticket IDs PDFNET-35063, PDFNET-35064 and PDFNET-35065. The ticket ID PDFNET-35063 is on the high priority list under the free support model. The ticket IDs PDFNET-35064 and PDFNET-35065 are dependent on the internal feature and postponed for a later date. We have logged ETA request under these three ticket IDs. We will let you know once a significant progress has been made in this regard.

are the PDFNET tickets linked above anything you guys intend to work?

@mike.doerfler

Thanks for contacting support.

Please note that we resolve every logged issue/feature/enhancement request, however they are resolved on first come first serve basis in free support model. Timeline for resolution of ticket depends upon how long is the queue of previously logged issues under normal/free support model.

Earlier logged and linked issues to this thread are not yet resolved because of large number of pending issues in the queue and they are postponed for future releases of the API. As soon as we have some definite updates regarding their resolution, we will let you know.

You may please also check our priority/paid support option in case these tickets are urgent and you need to get them resolved on priority basis.

We are sorry for the inconvenience.

The issues you have found earlier (filed as PDFNET-35063) have been fixed in Aspose.PDF for .NET 24.3.