Optimize Fonts when concatenating pdfs

Dear ladies and gentlemen,



In our application we have to combine several (eg 100) PDF files.

To optimize the resulting file we use this code:



pdfFileEditor.Concatenate(pdfStreams.ToArray(), packPdf);



packPdf.Seek(0, SeekOrigin.Begin);

var pdfDocument = new Document(packPdf);

foreach (Page page in pdfDocument.Pages)

{

var idx = 1;

foreach (XImage image in page.Resources.Images)

{

using (var imageStream = new MemoryStream())

{

image.Save(imageStream, ImageFormat.Jpeg);

imageStream.Seek(0, SeekOrigin.Begin);

page.Resources.Images.Replace(idx, imageStream);

}

idx = idx + 1;

}

}



// optimize the file size

pdfDocument.Optimize();

pdfDocument.OptimizeSize = true;

pdfDocument.OptimizeResources(new Document.OptimizationOptions

{

RemoveUnusedStreams = true,

RemoveUnusedObjects = true,

LinkDuplcateStreams = true

});

// save updated File

pdfDocument.Save(newPdfFileName);





After this optimization the size of the created pdf file is still too large. The cause of this is due to the fonts used in the source files.

The font definitions (/font-Dictionary and dependent objects) were taken for each original document into the target file.

If the /FileFonts2 streams are identical, only one stream was saved. If the streams are not identical, no union set of all required characters was formed.

The difference in file size, with an average size of the streams of 18.5 KB, is about 3 MB.



Is there a way to summarize the fonts efficiently?

Is such an implementation planed?



Best regards



Kind Regards,

Oliver

Hi Oliver,


Thanks for contacting support and sorry for the delayed response.

I have tested the scenario using Aspose.Pdf for .NET 10.2.0 in Visual Studio 2010 application with target platform as .NET Framework 4.0 running over Windows 7(x64) and as per my observations, when I have tried concatenating 3 copies of earlier shared Bescheid_1_to_59_neu.pdf file with size 3.47MB, the resultant concatenated PDF file is 3.46MB. The size of concatenated file is equal to size of individual source file.

The size of resultant file can be further reduced by un-embedding custom fonts used inside the document.

[C#]

//array of streams<o:p></o:p>

FileStream[] pdfStreams = new FileStream[3];

pdfStreams[0] = new FileStream("c:/pdftest/Bescheid_1_to_59_neu.pdf", FileMode.Open);

pdfStreams[1] = new FileStream("c:/pdftest/Bescheid_1_to_59_neu - Copy.pdf", FileMode.Open);

pdfStreams[2] = new FileStream("c:/pdftest/Bescheid_1_to_59_neu - Copy (2).pdf", FileMode.Open);

MemoryStream packPdf = new MemoryStream();

Aspose.Pdf.Facades.PdfFileEditor pdfFileEditor = new PdfFileEditor();

pdfFileEditor.Concatenate(pdfStreams.ToArray(), packPdf);

packPdf.Seek(0, SeekOrigin.Begin);

var pdfDocument = new Document(packPdf);

foreach (Page page in pdfDocument.Pages)

{

var idx = 1;

foreach (XImage image in page.Resources.Images)

{

using (var imageStream = new MemoryStream())

{

image.Save(imageStream, System.Drawing.Imaging.ImageFormat.Jpeg);

imageStream.Seek(0, SeekOrigin.Begin);

page.Resources.Images.Replace(idx, imageStream);

}

idx = idx + 1;

}

}

// optimize the file size

pdfDocument.Optimize();

pdfDocument.OptimizeSize = true;

pdfDocument.OptimizeResources(new Document.OptimizationOptions

{

RemoveUnusedStreams = true,

RemoveUnusedObjects = true,

LinkDuplcateStreams = true,

AllowReusePageContent=true

});

// save updated File

pdfDocument.Save(“c:/pdftest/OptimizedFile.pdf”);

Hi,

if you concatenate the 3 copies the result is as expected. The font streams of each copy are identical, so the number of font-streams in the target file is equal to the number of font-streams in one of the copied files.

Our problem is the font-streams are not combined even if it is the same font. This is increasing the number of font-streams and as a result the file size.

Perhaps the attached files help to understand what I mean. On the old way the Single-PDFs were extracted from one big file (oldway.pdf). The new way generates single PDFs (generated1.pdf-generated5.pdf) and concatenate these to one file (newway.pdf). We need both the single and combined PDF.

The size of the new single PDF files have increased because these are PDF/A files. The main Problem seems to be the increased number of font-streams. Even with Aspose.PDF v10.2.0 and the use of the property AllowReusePageContent the file size is too large. I think un-embedding custom fonts is no option because of the PDF/A standard.



Best regards

Oliver

Hi Oliver,


Thanks for sharing the details.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-38425. We
will investigate this issue in details and will keep you updated on the status
of a correction. <o:p></o:p>

We apologize for your inconvenience.

Hi,

any news on this topic for me?

kind regards,
Oliver

Hi Oliver,


Thanks for your inquiry. I am afraid you reported is still not resolved, it is pending for investigation in queue due to other issues already under investigation/resolution. We will notify you as soon as we made some significant progress towards issue resolution.

We are sorry for the inconvenience caused.

Best Regards,

Hi,

any news for me on this topic. We are waiting for a solution, we need to help our customers. Could you please check, and give me a date for the solution.

Kind regards,
Oliver

Hi Oliver,


Thanks for your inquiry. I am afraid your issue is still not resolved. Currently Product team is busy in resolving other reported issues. However we have requested our team to investigate the issue and share an ETA at their earliest. We will notify you as soon as we get a feedback.

Thanks for your patience and cooperation.

Best Regards,

Hi again,

could you please check again with the developers for an ETA?

thanks,

Kind regards,
Olive

Hi Oliver,


Thanks for your patience. Our product team has investigated the issue and found that your files generated1, generated2 etc don’t seem be extracted pages of oldway.pdf. For example oldway.pdf has image size 1613 bytes encoded with FlateDecode but it is 17992 bytes in genertate1.pdf.

Moreover, fonts in generated1, generated2 etc files are not the same as in original (oldway) file, ArialItalic font in original file whereas ArialMT is in generated file. Furthermore, please note that fonts in generated files are subsets and they are not the same. For example in generated1 file, font file stream has length 24724 and in generated2 file it has length 24606. It means that these fonts are not the same and may not be optimized(reused) i.e. every of these fonts is included into resultant file.

So you should improve process of page extraction. Please check following code sample, it produces set of documents extracted from original file, after that concatenates these files into one document and optimizes it. Hopefully it will help you to accomplish the task.

PdfFileEditor pfe = new
PdfFileEditor();<o:p></o:p>

int[][] pagesToExtract = new int[][] { new int[] { 1, 3 }, new int[] { 5 }, new int[] { 7, 9 }, new int[] { 11 }, new int[] { 13, 15 } };

for(int i = 0; i < 5; i++)

{

pfe.Extract("oldway.pdf", pagesToExtract[i], "38425-generated" + (i + 1) + ".pdf");

}

FileStream[] pdfStreams = new FileStream[5];

pdfStreams[0] = new FileStream("38425-generated1.pdf", FileMode.Open);

pdfStreams[1] = new FileStream("38425-generated2.pdf", FileMode.Open);

pdfStreams[2] = new FileStream("38425-generated3.pdf", FileMode.Open);

pdfStreams[3] = new FileStream("38425-generated4.pdf", FileMode.Open);

pdfStreams[4] = new FileStream("38425-generated5.pdf", FileMode.Open);

Aspose.Pdf.Facades.PdfFileEditor pdfFileEditor = new PdfFileEditor();

FileStream outStream = new FileStream("38425-concatenated.pdf", FileMode.Create, FileAccess.ReadWrite);

pdfFileEditor.Concatenate(pdfStreams, outStream);

outStream.Close();

Document doc = new Document("38425-concatenated.pdf");

doc.OptimizeResources(new Document.OptimizationOptions

{

RemoveUnusedStreams = true,

RemoveUnusedObjects = true,

LinkDuplcateStreams = true,

AllowReusePageContent = true

});

doc.Save("38425-optimized.pdf");


Best Regards,

Dear ladies and gentlemen,



thanks for your reply but I think there was a misunderstanding. The files generated1.pdf to generated5.pdf have been generated by our application in

a new way and have been combined to newway.pdf.

The file oldway.pdf is the result of the old generation to show you how small the result was with the old generation.

As you can see is the difference in size quite large.

What we need now is a way to combine the newly generated files (generated1.pdf - generated5.pdf) so that the size of newway.pdf is reduced (similar to oldway.pdf).

The attached source file is a part of our used code.

Thanks in advance.



Best regards

Oliver

Hi Oliver,


Thanks for your feedback. We have shared the details with our product team for further investigation. We will keep you updated about the issue resolution progress.

Thanks for your patience and cooperation.

Best Regards,

Hi,
do you have any information from your product-team for me concerning this issue?

thank you and kind regards,
Oliver

Hi Oliver,


the product team has further investigated the issue and as per our observations, oldgeneration.pdf contains different fonts and images then “generated” files and that’s why currently we are not certain if it can illustrate how small size it may be. It appears to be a different document.

Please note that fonts can not be optimized because fonts in files generated1, generated2 etc are different because those fonts are subsets. Currently every font from generated files is included into concatenated files because these fonts are different (since they are subsets). However when talking about “new” and “old” generation, do you mean some changes in your software so you could use full version of the same font (not subset) in every of generated files? This may give a chance to use this font only once in concatenated files. Therefore it appears to be impossible because fonts are different in generated files.

In other words: “The font streams of each copy are identical” is not quite correct; fonts are different in generated documents because these fonts are subsets.

Hi,

<use full version of the same font (not subset) in every of generated files? This may give a chance to use this font only once in concatenated files>
we changed our program and were ablre to reduze the pdf-size quite good. This is solved, Thanks for your help.

Kind regards,
Oliver

Hi Oliver,


Thanks for the acknowledgement.

We are glad to hear that your problem is resolved. Please continue using our API’s and should you have any further query, please feel free to contact.

The issues you have found earlier (filed as PDFNEWNET-38425) have been fixed in Aspose.Pdf for .NET 10.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan