We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Generated PDF 16 times too large

I am using the evaluation version of Words and Pdf for an application I am developing.

I have a source file written in word, which loaded, merged with a data source, and then saved as a pdf file.

Aspose.Words.Document doc = new Aspose.Words.Document(filePath);

/\* replace content here \*/

doc.Save(savePath + ".xml", Aspose.Words.SaveFormat.AsposePdf);
Aspose.Pdf.Pdf pdf = new Aspose.Pdf.Pdf();
pdf.BindXML(savePath + ".xml", null);
pdf.IsImagesInXmlDeleteNeeded = true;
pdf.CompressionLevel = 9;
pdf.Save(savePath);

However, the completed file is 2.5mb. By comparison, if I open the word document, manually merge the contents and use the Adobe PDF Maker from within Microsoft Word, the file is 149kb.

I’ve uploaded a zip file with the two files in it for comparison:
http://65.110.95.184/ProposalTest.zip

Is there a solution for this? Because other than the file size, this application is perfect for my needs.

JK

Thank you for considering Aspose.

Can you please also provide the Word document and let us check it?

I have uploaded the zip file in the link above to include the source word doc as well.

JK

Dear JK,

I tested your document and found Aspose.Words extract 5 images from the Word document during the Word2Pdf process. Those 5 files are 2.29MB in size. So I think this issue should be reported to the Aspose.Words forum. I will move this thread to the Aspose.Words forum.

Hi,

Indeed, currently we do output duplicated images to PDF. I’ve logged the issue as Auckland-72 and we hope to impove the functionality shortly. Thank you for reporting it to us.

Thanks for this.

Is there a workaround I can use on an immediate basis, without waiting for a patch or upgrade to the program?

JK

Yes, here is a workaround for this document only. It modifies the XML so that all instances of the duplicated image refer to the same image file:

[Test]
public void TestDuplicateImages()
{
    Document doc = new Document("D:\\TestDuplicateImages.doc");
    MemoryStream stream = new MemoryStream();
    doc.Save(stream, SaveFormat.AsposePdf);
    stream.Seek(0, SeekOrigin.Begin);
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(stream);
    XmlNamespaceManager nm = new XmlNamespaceManager(xmlDoc.NameTable);
    nm.AddNamespace("def", "Aspose.Pdf");
    XmlNodeList imageNodes = xmlDoc.DocumentElement.SelectNodes("//def:Image", nm);
    foreach (XmlNode imageNode in imageNodes)
    {
        string imageFileName = imageNode.Attributes["File"].Value;
        imageFileName = ReplaceImageNumber(imageFileName, "003.jpeg");
        imageFileName = ReplaceImageNumber(imageFileName, "005.jpeg");
        imageNode.Attributes["File"].Value = imageFileName;
    }
    Aspose.Pdf.Pdf pdf = new Aspose.Pdf.Pdf();
    pdf.IsImagesInXmlDeleteNeeded = true;
    pdf.BindXML(xmlDoc, null);
    pdf.IsTruetypeFontMapCached = true;
    pdf.TruetypeFontMapPath = Path.GetTempPath();
    pdf.Save("D:\\TestDuplicateImages Out.pdf");
}

The resulting PDF is still around 1 Mb but since the image size is over 700 Kb, I’m not sure what else optimizations can be applied here.

Hope this helps.

Thanks very much for this. 1mb is definitely better, albeit not the same as the compression settings of the direct-to-pdf.

A couple of questions on this.

Do you think there is a flaw in the original Word document that is resulting in the duplicated images? Any changes I could make to it that would prevent this from happening?

Alternatively, any ideas about how I could programatically determine if there are duplicate images and handle them with the same code above? Perhaps by file size and header bytes or something? Just so I wouldn’t have to customize the program for each file processed.

And lastly, if I were to use the System.Drawing.Image namespace to load the jpeg files and resize them with the GDI+ tools prior to importing into a PDF – would this mess with the Aspose pdf generation? It occurs to me that recompressing the jpegs is probably how Acrobat Distiller is shrinking the file down to 150kb, and I might follow suit.

Thanks,
JK

No, the original Word document is totally valid. It just contains the same image inserted into several places. Since at the moment it is not handled specially by Aspose.Words when exported to PDF, each instance is exported regardless of the fact the image is the same. That is what to be improved.

I guess the only reliable programmatical way of determining duplicated images is calculation and comparison of the hash value of image data. It will slow down the process of export so I presume it must be optional and only enabled when the source document may contain duplicated images, just like in your case. Yes, basically meanwhile you can modify the code snippet posted above so that it would automatically detect duplicated images by files size plus something like header bytes and it will work in 99% cases; note it is not totally safe though. The only method that guarantees the images are indentical is verifying their hash values are the same.

Yes, recompressing JPEG images is a good idea. You can load the images into the Bitmap objects and then resave them with the desired JPEG quality:

public static void SaveJpeg(Image image, Stream stream, int quality)
{
    ImageCodecInfo encoderInfo = GetEncoderInfo(ImageFormat.Jpeg);
    EncoderParameters encoderParams = new EncoderParameters();
    encoderParams.Param[0] = new EncoderParameter(Encoder.Quality, quality);
    image.Save(stream, encoderInfo, encoderParams);
}

Please let me know if you’ve succeeded in reducing the size of the output PDF.

This has helped remarkably, I think I’m almost finished.

One last question. In converting a word doc to a pdf, the Adobe conversion program ignores all Microsoft Word form fields, and converts them to flat text in the PDF (as opposed to PDF fillable form fields).

Is there any easy way with Aspose to convert the document to a PDF, preserving the Word form fields as PDF Text Fields?

JK

Sure, just set

doc.SaveOptions.PdfExportFormFieldsAsText = false;

There may not be an easy solution to this, but I was wondering if there is any way to preserve the font information (family, size) when coverting the form field to PDF?

JK

It seems like font size and color are preserved. I have logged your request as Auckland-79 in order to implement carrying over other font properties later. This may require the assistance of Aspose.Pdf team.

We have released a new version of Aspose.Words that contains a fix for image duplication issue.
Now same images are exported only once reducing file sizes significantly. That is implemented for DOC, PDF and HTML exports so far. Image duplication in WordML will be fixed in the next version.
The new version is available for download at:
https://releases.aspose.com/words/net/
Best regards,

The issues you have found earlier (filed as 2079) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for .NET 18.12 update and this Aspose.Words for Java 18.12 update.