Stripping repeated images out of multiple similar pdfs

Hi There,

I have a client with a database full of pdfs. Currently the pdfs take up a considerable amount of space. I am looking for a way to bring down that space usage. I have tried compression but I am only attaining a size benefit of about 18%.

All of the pdfs are pretty similar and belong to one of three or four major variants with a few more minor variations within those , but can have a different number of tables of data and columns etc within. As such I cant really use a templte mechanism and store the data in the database and re-merge with the template at a later stage. One possible strategy I do have in mind is to strip out all the common images (of which there are quite few in each pdf) and store these separately once. To achieve this I would need to be able to extract just the structure of the pdf (possibly to xml) and separate out and tag all the images and then be able to reasemble it again at a later stage.

Speed is not essential in this process as the files are very rarely
if ever looked at after a couple of months and I would only do this to
older files.

Does this sound achievable to anyone here using Aspose.Pdf and Aspose.Pdf.Kit (which we are already licensed for)? And if so what are the steos I should take? Are there going to be any major problems with this approach?

Hi Kyle,

First of all, please let me share how I understood your requirement:

In order to reduce the size of the PDF files, you would like to remove all of the images from these files and save those images separately with some tags or (say index number) and then at some later stage you’ll be adding those images in their respective PDF files at the particular location (where they already were) at their original size (the height and width they already had in the original PDF file).

Is that correct? Or you would like to add or remove something from the above statement. Can you please share some sample PDF file you’re working with at your end?

Please comment on this so we would be able to help you out.
Regards,

Hi Shahzad,

That is correct.

I was hoping that I could use the Aspose.PDF xml format and get the images out of that and replace with a tag but I cant find a way to save from a pdf file to the xml format.

Any assisstance would be greatly appreciaed.

My client probably wouldn’t appreciate me posting the pdf in a public place and it would be quite hard to produce one full of dummy data, so I have sent the pdf in an email to you.

Regards,
Kyle

Hi Kyle,

Thank you very much for sharing the PDF file and your comments.

I would like to share with you that although this feature is currently not supported by Aspose.Pdf.Kit, however I have logged a new feature request as PDFKITNET-14887 in our issue tracking system. Our team will look into this requirement and you’ll be updated via this forum thread once it is supported.

We’re sorry for the inconvenience.
Regards,

Hi Shahzad,

Thats great thanks! Not meaning to be pushy but if the feature does get approved do you have any idea roughly how long it would take before it was available?

Regards,
Kyle

Hi Kyle,

I’m sorry, I can’t share the ETA at the moment. In fact, our team still need to investigate the issue in detail to find out the time and effort required for this feature. You’ll be updated once we get a clear picture.

We’re sorry for the inconvenience.
Regards,

The issues you have found earlier (filed as 14887) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

A post was split to a new topic: Reduce PDF size when same image is present in all pages