Free Support Forum - aspose.com

Aspose Pdf uses two ways to optimize PDF file size

Hi Team,
I was using the following code earlier to produce optimized PDFA file:

    Document doc = new Document("Office.pdf");
    PdfFormatConversionOptions opts = new PdfFormatConversionOptions("logFile.Txt", PdfFormat.PDF_A_1B, ConvertErrorAction.Delete);
    opts.setOptimizeFileSize(true);
    doc.convert(opts);
    doc.save("AsposePDF.pdf");

We had a query( https://forum.aspose.com/t/convert-pdf-to-pdf-a-pdfa-output-size-varies-hugely-for-similar-input-files-ecdcts-5563/179342 ) earlier which got resolved in Aspose.Pdf 20.7.
Here, the following code is being used for producing optimized PDFA file:

    Document doc = new Document("Office.pdf");
    PdfFormatConversionOptions opts = new PdfFormatConversionOptions("logFile.Txt", PdfFormat.PDF_A_1B, ConvertErrorAction.Delete);
    opts.setConvertSoftMaskAction(ConvertSoftMaskAction.ConvertToStencilMask);
    doc.convert(opts);
    OptimizationOptions optimizationOptions = new OptimizationOptions();
    optimizationOptions.setSubsetFonts(true);
    optimizationOptions.setRemoveUnusedStreams(true);
    optimizationOptions.setRemoveUnusedObjects(true);
    doc.optimizeResources(optimizationOptions);
    doc.save("AsposePDF.pdf");

Now the query is:

Which should be used now on regular basis?
What is the difference between both of the functionalities?
Is there any side effect of using second option as it produces very less sized output file comparatively?

Thanks,
Avinash.

@sumeetm

The suggested code snippet saves images in FlateDecode format and that is why it was recommended to use. We did not find any side effects for using the code while testing it. In case you are facing some error or noticing any anomaly, please feel free to share it with us. We will surely investigate the case in detail and share our feedback with you accordingly.

Hi @asad.ali,

  1. Can you please give the technical depth of the difference between both the options.
  2. setOptimizeFileSize is used for the pdf structure optimization and only specifically deals with pdf to pdfa usecase as explained here in our earlier tickets. Does it not handle images in any way?
    Need details about PdfFormatConversionOptions.setOptimizeFileSize API
  3. optimizationOptions saves the image in flateDecode format. Does it not optimize the document structure or only deals with the images or both.
  4. Is one subset of the other?
  5. Can optimizationOptions be used for other cases as well or only for PDF to PDFA?

Please reply to all 5 queries.

Regards,
Ankur Vashishtha

@sumeetm

The referenced thread is quite older and many changes to the API have been made since then. We will share our detailed feedback against your all inquiries in accordance with the latest version of the API. We are gathering details from our end and will share them with you soon.

As an initial response, we would like to add that setOptimizeFileSize method is offered only when a PDF is being converted into PDF/A. If you are converting a document into PDF/A format and want to reduce file size, its better to join both functionalities - flag OptimizeFileSize and optimizationOptions.

We will get back to you with further details shortly.

Hi @asad.ali,

Can we have the update please?

Thanks,
Avinash.

@sumeetm

We are gathering the details from our side and we intend to provide you detailed feedback as soon as possible. We will try to share the required information during current week. Please give us little time.

Hi @asad.ali,
Please update us with the required information. This is in relation to an escalated issue which was reported in the enterprise support. Please provide us the info asap, else all the hard work done by the enterprise support will be undone.

Regards,
Ankur Vashishtha

@sumeetm

We are currently in process to publish new release of the API and we intend to publish required information once an ongoing task is done. However, could you please share the ticket ID associated with enterprise support which is related to your this inquiry. Also, you may post your inquiry under other ticket as well which you opened in helpdesk (paid support forum). This way the priority of the investigation will be escalated. We greatly appreciate your patience in this matter.

Hi @asad.ali,
How is it possible to release a jar without its documentation?
We have been checking with Boris consistently on this issue as it has been very highly escalated from customer. Kindly, understand the importance of it.
Adding the mail reply sent by Boris here :

From: Boris Pazin <Boris.Pazin@aspose.com>
Sent: 22 July 2020 21:45
To: Karthik Gokare Somashekhar <ksomashekhar@opentext.com>; Mahesh Rao <mrao@opentext.com>; Jenna Jessop <jjessop@opentext.com>
Cc: Ankur Vashishtha <avashish@opentext.com>
Subject: [EXTERNAL] - RE: OpenText: PDFJAVA-37850, PDFJAVA-37882

Hi Mahesh, Kartik,

I am glad to inform you that this issue is fixed in new Aspose.PDF for Java 20.7.

Please use the code snippet like this:

Document doc = new Document(“Office.pdf”);

PdfFormatConversionOptions opts = new PdfFormatConversionOptions(“logFile.Txt”, PdfFormat.PDF_A_1B, ConvertErrorAction.Delete);

opts.setConvertSoftMaskAction(ConvertSoftMaskAction.ConvertToStencilMask);

doc.convert(opts);

OptimizationOptions optimizationOptions = new OptimizationOptions();

optimizationOptions.setSubsetFonts(true);

optimizationOptions.setRemoveUnusedStreams(true);

optimizationOptions.setRemoveUnusedObjects(true);

doc.optimizeResources(optimizationOptions);

doc.save(“AsposePDF.pdf”);

Thanks,
Boris Pazin

We can’t use the code without the proper information.

Thanks,
Avinash.

@sumeetm

We publish all the information (which our customers are commonly interested in) related to the API in public documentation as well as in API References. However, you have asked for TECHNICAL DEPTHS of different methods and we need to gather all the details against implemented logics in the API in order to entertain you properly.

We have noted your concerns and updated the referenced ticket accordingly as well. We will soon get back to you with required information. Please give us some time.

Hi @asad.ali,
Can you please provide us an update on this, We are in a situtation here?

Thanks,
Avinash.

@sumeetm

Please check the following details about the queries you have asked:

OptimizeResources() Method

Document.OptimizeResources() method allows to decrease document size. Several methods may be used for this, method usage is managed by OptimizationOptions. Below is an explanation of options/methods.

  • RemoveUnusedObjects: PDF document consists of PDF objects. Every object has its number (ID) and may belong to one of the following types: name, string, number, null (scalar values of these types) dictionary, array ( forms PDF document structure) stream (raw binary data). Objects may be referenced from other objects, for example, a dictionary or array may contain references to other objects. These references unite all parts of the PDF document and form a PDF document structure. Stream objects contain binary data, and the size of these data may be large. For example, images or fonts are stored as stream objects. After some manipulations with the document, some streams may be “orphaned” i.e. they may don’t have any reference to them. For example, the old image was replaced with the new one, but the old instance of the image was not removed. In other words, the stream does not belong anymore to the document logically but still contained in the document physically. RemoveUnusedObjects method finds orphaned objects in the document and removes them, this can help to decrease document size of such objects were found.

  • RemoveUnusedStreams: Every document page has its Resources dictionary which contains data like images, fonts, etc. which are used in the page contents. Resources are referenced by their names in the dictionary, for example, the page may contain the operator to draw the image with the name “Image12” on the particular place of the page. In some cases, the resource may become unused, for example, the image was removed from the page contents but left in page resources, or the page was extracted from the document but its resources still contain common resources of the document. Resource became “orphaned”, please note that this is another situation, then described in RemoveUnusedObject explanation because the object still referenced from the resources dictionary of the page, but the resource is never used by the page (its name never used in page contents). RemoveUnusedStreams finds and removes these unnecessary resources. Since after this process removed resource stream objects became not linked with document structure, RemoveUnusedObjects option is automatically activated when RemoveUnusedResources is used.

  • LinkDuplicateStreams: Document may contain several copies of the stream with the same contents. For example, this situation may occur when two or more identical documents are merged: every copy of the same page has its own resources dictionary with different images, fonts, etc. resources inside. LinkDuplicateStreams finds stream objects with equal contents and merges them into one object, replacing references to the objects accordingly. This allows decrease document size because duplicated information is removed.

  • SubsetFonts: Every font used to display text on the page contains a set of glyphs for font characters. PDF specification supports “font subset” i.e. font with only those glyphs which are used. This may cause issues when text should be updated (since probably required glyphs are absent in the font), but for the document which is not planned to change this allows to decrease size.

  • UnembedFonts: Fonts may be embedded into PDF document i.e. all font data are contained in font resource or be not embedded when required font is loaded from the fonts installed on the computer. The unembedding font may help to decrease size but may cause issues when the document is displayed on the computer where required fonts are not installed.

  • AllowReusePageContents: The page content is a set of operators describing page appearance. Page content stored in the stream object. If the document contains equal pages, their contents can be merged i.e. different pages share one stream object which contains their contents. This may allow decreasing document size if the page content is large. The disadvantage is that when one of the pages is changed, all its copies will be updated accordingly (since they use the same object).

  • RemovePrivateInfo: The page may contain private info for conforming reader, this entry may contain information of any type and sometimes this information has a large size. RemovePrivateInfo allows removing of this information.

  • ImageCompressionOptions: File size optimization may be done by image optimization. But image recompression/resizing may cause image quality loss.

  • OptimizationOptions contains a set of options for image compression (ImageCompressionOptions ).

  • CompressImages: this flag determines if image compression is allowed. If this is false (default), no changes to the images are made.

  • ResizeImages: flag determines is a change of image dimensions is allowed. False by default.

  • ImageQuality: is the required quality of the image (in percent). Applicable when CompressImages is true. Images a recompressed using the JPEG algorithm and given quality.

  • MaxResolution: Maximum desired resolution of the images (in DPI). The dimension of the Images with the resolution higher than specified maximum resolution is decreased according to the specified resolution. The image resolution is calculated on the basis of specified image size on the page in user units and physical image dimensions in pixels. -ImageEncoding image encoding which will be used to try recompress images. For some cases (for example for some monochrome images) Flate decoding may give a better size than JPEG compression. Please note that specifying ImageEncoding does not mean that all images will be recompressed with this algorithm. Image optimizer tries to use a specified format and if compressing in this format does not decrease image size, recompression is not done.

Optimize() method

This method is also known as Linearisation. Calling of Document.Optimize() and setting Document.IsLinearized to true is identical.

Linearization is the process of PDF document optimization for use in the Web. The purpose of this process is to display the first pages of the PDF document as soon as possible if the document is loaded over a slow connection. In order to achieve this result, document objects are reordered so that the significant document structures and first pages structure is placed at the beginning of the document. Please note that linearization does not decrease the size of the document. This process is called Optimization in terms of optimization for fast loading the document over Web.

PDF/A conversion
PdfFormatConversionOptions

  • OptimizeFileSize
    if this option set to true, additional actions to decrease document size will be made during PDF/A conversion. But this may take additional time. For now, the only action of OptomizeFileSize is applying font subsets to fonts used in the document.I.e. this is the same as OptimizationOptions.SubsetFonts. Later, we are planning to introduce other methods of file size decreasing during PDF/A conversion.

  • ConverSoftMaskAction
    Defines how images with the soft mask are handled. If this property set to ConvertToStencilMask, the image with a soft mask will be converted into an image with a stencil mask. Else part of the page converted by image will be converted into JPG and the resultant image will be drawn on the page instead of the original image. Converting to a stencil mask allows to decrease the size of the image but may cause loss of image quality in some cases.

Thus taking into consideration said above,

See the details above.

As described above, OptimizeFileSize at the moment only affects to the fonts. But in further implementation, this option will optimize image size too.

As described above, OptimizeOptions manages document optimization methods, including repacking images. Image may be stored in FlateDecode format if optimizer founds that this decreases the size of the image. Other options of OptimizeResources affect document structure, objects, streams, fonts, etc.

If we are talking about ImageCompressionOptions and OptimizationOptions, then yes, ImageCompressionOptions are part of OptimizationOptions. PdfAConvertStrategy.OptomizeFileSize property is not a subset of OptimizationOptions, this is an independent option, although partially it does the same (fonts subsetting).

OptimizationOptions and OptimizeResources() are independent of PDF/A conversion and may be used for any PDF document, not only during PDF/A conversion.