Aspose Pdf uses two ways to optimize PDF file size

rnara · August 4, 2020, 3:11am

Hi @asad.ali,
Please update us with the required information. This is in relation to an escalated issue which was reported in the enterprise support. Please provide us the info asap, else all the hard work done by the enterprise support will be undone.

Regards,
Ankur Vashishtha

asad.ali · August 4, 2020, 7:31pm

@sumeetm

We are currently in process to publish new release of the API and we intend to publish required information once an ongoing task is done. However, could you please share the ticket ID associated with enterprise support which is related to your this inquiry. Also, you may post your inquiry under other ticket as well which you opened in helpdesk (paid support forum). This way the priority of the investigation will be escalated. We greatly appreciate your patience in this matter.

rnara · August 5, 2020, 4:10am

Hi @asad.ali,
How is it possible to release a jar without its documentation?
We have been checking with Boris consistently on this issue as it has been very highly escalated from customer. Kindly, understand the importance of it.
Adding the mail reply sent by Boris here :
Hi Mahesh, Kartik,

I am glad to inform you that this issue is fixed in new Aspose.PDF for Java 20.7.

Please use the code snippet like this:

Document doc = new Document(“Office.pdf”);

PdfFormatConversionOptions opts = new PdfFormatConversionOptions(“logFile.Txt”, PdfFormat.PDF_A_1B, ConvertErrorAction.Delete);

opts.setConvertSoftMaskAction(ConvertSoftMaskAction.ConvertToStencilMask);

doc.convert(opts);

OptimizationOptions optimizationOptions = new OptimizationOptions();

optimizationOptions.setSubsetFonts(true);

optimizationOptions.setRemoveUnusedStreams(true);

optimizationOptions.setRemoveUnusedObjects(true);

doc.optimizeResources(optimizationOptions);

doc.save(“AsposePDF.pdf”);

Thanks,
Boris Pazin

We can’t use the code without the proper information.

Thanks,
Avinash.

asad.ali · August 5, 2020, 5:08pm

@sumeetm

We publish all the information (which our customers are commonly interested in) related to the API in public documentation as well as in API References. However, you have asked for TECHNICAL DEPTHS of different methods and we need to gather all the details against implemented logics in the API in order to entertain you properly.

We have noted your concerns and updated the referenced ticket accordingly as well. We will soon get back to you with required information. Please give us some time.

rnara · August 11, 2020, 4:22am

Hi @asad.ali,
Can you please provide us an update on this, We are in a situtation here?

Thanks,
Avinash.

asad.ali · August 11, 2020, 6:24pm

@sumeetm

Please check the following details about the queries you have asked:

OptimizeResources() Method

Document.OptimizeResources() method allows to decrease document size. Several methods may be used for this, method usage is managed by OptimizationOptions. Below is an explanation of options/methods.

RemoveUnusedObjects: PDF document consists of PDF objects. Every object has its number (ID) and may belong to one of the following types: name, string, number, null (scalar values of these types) dictionary, array ( forms PDF document structure) stream (raw binary data). Objects may be referenced from other objects, for example, a dictionary or array may contain references to other objects. These references unite all parts of the PDF document and form a PDF document structure. Stream objects contain binary data, and the size of these data may be large. For example, images or fonts are stored as stream objects. After some manipulations with the document, some streams may be “orphaned” i.e. they may don’t have any reference to them. For example, the old image was replaced with the new one, but the old instance of the image was not removed. In other words, the stream does not belong anymore to the document logically but still contained in the document physically. RemoveUnusedObjects method finds orphaned objects in the document and removes them, this can help to decrease document size of such objects were found.
RemoveUnusedStreams: Every document page has its Resources dictionary which contains data like images, fonts, etc. which are used in the page contents. Resources are referenced by their names in the dictionary, for example, the page may contain the operator to draw the image with the name “Image12” on the particular place of the page. In some cases, the resource may become unused, for example, the image was removed from the page contents but left in page resources, or the page was extracted from the document but its resources still contain common resources of the document. Resource became “orphaned”, please note that this is another situation, then described in RemoveUnusedObject explanation because the object still referenced from the resources dictionary of the page, but the resource is never used by the page (its name never used in page contents). RemoveUnusedStreams finds and removes these unnecessary resources. Since after this process removed resource stream objects became not linked with document structure, RemoveUnusedObjects option is automatically activated when RemoveUnusedResources is used.
LinkDuplicateStreams: Document may contain several copies of the stream with the same contents. For example, this situation may occur when two or more identical documents are merged: every copy of the same page has its own resources dictionary with different images, fonts, etc. resources inside. LinkDuplicateStreams finds stream objects with equal contents and merges them into one object, replacing references to the objects accordingly. This allows decrease document size because duplicated information is removed.
SubsetFonts: Every font used to display text on the page contains a set of glyphs for font characters. PDF specification supports “font subset” i.e. font with only those glyphs which are used. This may cause issues when text should be updated (since probably required glyphs are absent in the font), but for the document which is not planned to change this allows to decrease size.
UnembedFonts: Fonts may be embedded into PDF document i.e. all font data are contained in font resource or be not embedded when required font is loaded from the fonts installed on the computer. The unembedding font may help to decrease size but may cause issues when the document is displayed on the computer where required fonts are not installed.
AllowReusePageContents: The page content is a set of operators describing page appearance. Page content stored in the stream object. If the document contains equal pages, their contents can be merged i.e. different pages share one stream object which contains their contents. This may allow decreasing document size if the page content is large. The disadvantage is that when one of the pages is changed, all its copies will be updated accordingly (since they use the same object).
RemovePrivateInfo: The page may contain private info for conforming reader, this entry may contain information of any type and sometimes this information has a large size. RemovePrivateInfo allows removing of this information.
ImageCompressionOptions: File size optimization may be done by image optimization. But image recompression/resizing may cause image quality loss.
OptimizationOptions contains a set of options for image compression (ImageCompressionOptions ).
CompressImages: this flag determines if image compression is allowed. If this is false (default), no changes to the images are made.
ResizeImages: flag determines is a change of image dimensions is allowed. False by default.
ImageQuality: is the required quality of the image (in percent). Applicable when CompressImages is true. Images a recompressed using the JPEG algorithm and given quality.
MaxResolution: Maximum desired resolution of the images (in DPI). The dimension of the Images with the resolution higher than specified maximum resolution is decreased according to the specified resolution. The image resolution is calculated on the basis of specified image size on the page in user units and physical image dimensions in pixels. -ImageEncoding image encoding which will be used to try recompress images. For some cases (for example for some monochrome images) Flate decoding may give a better size than JPEG compression. Please note that specifying ImageEncoding does not mean that all images will be recompressed with this algorithm. Image optimizer tries to use a specified format and if compressing in this format does not decrease image size, recompression is not done.

Optimize() method

This method is also known as Linearisation. Calling of Document.Optimize() and setting Document.IsLinearized to true is identical.

Linearization is the process of PDF document optimization for use in the Web. The purpose of this process is to display the first pages of the PDF document as soon as possible if the document is loaded over a slow connection. In order to achieve this result, document objects are reordered so that the significant document structures and first pages structure is placed at the beginning of the document. Please note that linearization does not decrease the size of the document. This process is called Optimization in terms of optimization for fast loading the document over Web.

PDF/A conversion
PdfFormatConversionOptions

OptimizeFileSize
if this option set to true, additional actions to decrease document size will be made during PDF/A conversion. But this may take additional time. For now, the only action of OptomizeFileSize is applying font subsets to fonts used in the document.I.e. this is the same as OptimizationOptions.SubsetFonts. Later, we are planning to introduce other methods of file size decreasing during PDF/A conversion.
ConverSoftMaskAction
Defines how images with the soft mask are handled. If this property set to ConvertToStencilMask, the image with a soft mask will be converted into an image with a stencil mask. Else part of the page converted by image will be converted into JPG and the resultant image will be drawn on the page instead of the original image. Converting to a stencil mask allows to decrease the size of the image but may cause loss of image quality in some cases.

Thus taking into consideration said above,

See the details above.

As described above, OptimizeFileSize at the moment only affects to the fonts. But in further implementation, this option will optimize image size too.

As described above, OptimizeOptions manages document optimization methods, including repacking images. Image may be stored in FlateDecode format if optimizer founds that this decreases the size of the image. Other options of OptimizeResources affect document structure, objects, streams, fonts, etc.

If we are talking about ImageCompressionOptions and OptimizationOptions, then yes, ImageCompressionOptions are part of OptimizationOptions. PdfAConvertStrategy.OptomizeFileSize property is not a subset of OptimizationOptions, this is an independent option, although partially it does the same (fonts subsetting).

OptimizationOptions and OptimizeResources() are independent of PDF/A conversion and may be used for any PDF document, not only during PDF/A conversion.

rnara · September 24, 2020, 1:56pm

Hi @asad.ali,

I was doing test on some PDF files using the old code and new code shared by aspose for PDF optimization(upon PDF to PDF\A transformation), and found certain issues with the new code. Attaching the files here:

Test1.zip (2.2 MB)
Test5.zip (1007.7 KB)

The codes used here are mentioned above.

Thanks,
Avinash.

asad.ali · September 24, 2020, 8:22pm

@sumeetm

We have generated following tickets for the issues which have been observed in Aspose.PDF for Java 20.9:

PDFJAVA-39801

PDFJAVA-39802

We will further check them in details and keep you posted with the status of their rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

rnara · September 25, 2020, 1:54am

Hi @asad.ali,
Seems like this issue was opened in public, Can you please make it private?

asad.ali · September 25, 2020, 8:09pm

@sumeetm

Would you please explain what do you mean by making the issue private? Do you want us to make attachments private or complete thread?

rnara · September 28, 2020, 4:30am

Complete thread.

asad.ali · September 28, 2020, 7:05pm

@sumeetm

We have marked this whole thread as private now.

rnara · October 7, 2020, 4:10am

Hi @asad.ali,
Can we have an update on this issue?

Thanks,
Avinash.

asad.ali · October 7, 2020, 7:56pm

@sumeetm

The tickets were logged recently in our issue management system and we are afraid that they are pending for analysis. We will surely investigate and resolve them on first come first serve basis. We will inform you as soon as we have additional updates in this regard. Please give us some time.

We apologize for the inconvenience.

rnara · October 16, 2020, 4:22am

Hi @asad.ali,

Can we have an update on this issue?

Thanks,
Avinash.

asad.ali · October 16, 2020, 6:18pm

@sumeetm

We are afraid that no updates are available regarding tickets resolution. We will let you know within this forum thread as soon as we have some news about their fix.

We apologize for the inconvenience.

rnara · November 23, 2020, 5:55am

Hi @asad.ali,
Any update on this issue?

Thanks,
Avinash.

asad.ali · November 23, 2020, 12:49pm

@sumeetm

Regretfully, the tickets have not been completely investigated yet. As soon as analysis is completed, we will share updates with you within this forum threads. Please spare us some time.

We apologize for the inconvenience.

rnara · December 22, 2020, 4:57am

Hi @asad.ali,
Please share the status of the issue.

asad.ali · December 22, 2020, 8:06pm

@sumeetm

We regret to inform that no update is available at the moment regarding ticket(s) resolution. As shared earlier, we will surely inform you once they are completely investigated and rectified. We highly appreciate your patience and cooperation in this matter. Please give us some time.

We really apologize for the inconvenience and delay.