PDF Size Issue

rhustead · March 20, 2017, 6:30am

Hi,

We are generating a PDF document through the following process:-
1.) A static word Document is being converted into PDF.
2.)Then looping over all the pages of the generated PDF from above step, we are inserting a header having an image in all the pages.

The word template has 8 pages and has a size of 34 kb.(PFA)

The image used to insert in the header has a size of 29.8 kb.(PFA)

The PDF generated after the first step is also attached for reference and has a size of 164 kb.

The PDF generated after adding Header images is also attached and has a size of 808kb.

Following are the issues:-
1.) A 34 kb word document when converted to PDF has the size increased to 164 kb.
2.)As the header image is being repeated for 8 pages the size of the final PDF should increase by (29.8*8=)238.4kb but the final PDF has a size of 808 kb.

Attached is the code snippet used to generate the document.We have already used the optimize() method of Aspose.PDF api.

As the word template used can have many more pages than the sample shared here the PDF size will increase exponentially.
Please provide a resolution to decrease the size of PDF.

The version of Aspose Words being used is 16.4.0.0
The version of Aspose PDF being used is 17.1.0.0

fahadadeel · March 21, 2017, 6:09am

Hi Robert,

Thanks for contacting support.

I am looking into the details and will share my finding with you shortly.

Best Regards,

fahadadeel · March 21, 2017, 7:26am

Hi Robert,

Thanks for using our API’s.

I have tested the scenario and have managed to reproduce the problem that resultant PDF file size is bit more as expected. For the sake of correction, I have logged it as PDFJAVA-36613 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

Best Regards,

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Arial; -webkit-text-stroke: #000000}
span.s1 {font-kerning: none}

fahadadeel · March 24, 2017, 1:57am

Hi Robert,

After more investigation, we have found that for success optimization you can use following code snippet., it will result into 236 KB file.

JAVA

com.aspose.words.Document wdoc;
try {
wdoc = new com.aspose.words.Document(dataDir + “WordDocumentWithText.docx”);
wdoc.save(dataDir + “pdffile.pdf”, com.aspose.words.SaveFormat.PDF);
String imageName = “headerimage.png”;
File file = new File(dataDir, imageName);
byte[] headerimageBytes = null;
try {
headerimageBytes = Files.readAllBytes(file.toPath());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(dataDir + “pdffile.pdf”);

com.aspose.pdf.Image headerimage = new com.aspose.pdf.Image();
headerimage.setImageStream(new ByteArrayInputStream(headerimageBytes));
headerimage.setFixHeight(90);
com.aspose.pdf.PageCollection pgCol = pdfDocument.getPages();

for(int i=1;i<=pgCol.size();i++){
com.aspose.pdf.Page pg = pgCol.get_Item(i);
com.aspose.pdf.PageInfo pgInfo = pg.getPageInfo();
pgInfo.setWidth(792);
pgInfo.setHeight(612);
pgInfo.getMargin().setRight(0);
pgInfo.getMargin().setLeft(0);
com.aspose.pdf.HeaderFooter claimsHeader = new com.aspose.pdf.HeaderFooter();

claimsHeader.getMargin().setTop(0);

claimsHeader.getMargin().setRight(0);

claimsHeader.getMargin().setLeft(0);

claimsHeader.getParagraphs().add(headerimage);

pg.setHeader(claimsHeader);

}

ByteArrayOutputStream stream = new ByteArrayOutputStream();
pdfDocument.save(stream);
pdfDocument = new Document(new ByteArrayInputStream(stream.toByteArray()));
Document.OptimizationOptions opt = new Document.OptimizationOptions();
opt.setRemoveUnusedObjects(false);
opt.setLinkDuplcateStreams(false);
opt.setRemoveUnusedStreams(false);
// Enable image compression
opt.setCompressImages(true);

// Set the quality of images in PDF file
opt.setImageQuality(10);

pdfDocument.optimizeResources(opt);

pdfDocument.save(dataDir + “final1.pdf”);

Also, please note that a PDF file usually stores an image as a separate object (an XObject) which contains the raw binary data for the image. It is wrong to think of images embedded inside a PDF as Tif, Gif, Bmp, Jpeg or Png. They are not. It is important to appreciate that this is not usually an image in the sense of a Tif or a Jpg or a Png image – it is the binary data for the pixels, the colorspace used for the image, information about the Image. So the initial size of the final PDF is much more bigger than 238.4kb. The actual pixel data can be compressed and one of the compression formats (DCTDecode) is the same used as in a JPEG (JPX is the same as Jpeg2000).

If you still need further assistance, please feel free to contact us.

Best Regards,