Gigantic file size when saving Aspose.Words.Document as PDF

Continuing the discussion from Gigantic file size when saving Aspose.Cells.Workbook as PDF:

Dear support,
as per request of your colleague @amjad.sahi over in the Aspose.Cells department, I’m filing this performance issue with you guys. As stated in the linked topic, I don’t see this specific case as a practical example, your colleague nonetheless wanted this to be filed, so that you can investigate file size and runtime of the pipeline to maybe optimize both of them regardless of the specific data involved.

The processing pipeline (entirely implemented using Aspose.Cells & Aspose.Words) we’re using in this specific example would be

  1. Convert an Excel document into a Word document by recreating the data table of the Excel worksheet in a Word table. The document “Anonymized_Data2_*.xlsx” would be the input document to this step, which I’ve attached purely for reference.
  2. Post-processing the resulting Word document (by adding, depending on the exact use case, header, footer, data prior to and after the generated word table) and save the resulting Word document as docx. “result_*.docx” is the output of this step (- which we typically wouldn’t save as docx explicitly, but for your convenience and to compare file sizes, I’m handing you this one as well).
  3. Typically we’d populate the merge fields, but I’ve skipped that step in the reproduction.
  4. Save the Word document as a PDF.

For your convenience, I’ve attached the original Excel document (3292 rows) as well as 2 smaller versions with 10 and 500 rows respectively purely for reference (because you won’t be able to see the entire table in Word due to the sheer number of columns) as well as the respective word documents from step 2 of the pipeline. For the smaller versions I’ve simply truncated the number of data rows; but they might be easier to handle…

Also for your reference, I’ve noted the respective file sizes which blow up to 17 fold for the 10 row example and over 60 fold for the complete data sheet (comparing the size of the intermediate Word document and the resulting PDF generated using Aspose):

Filename File size [KB] Comments
Anonymized_Data2_10.xlsx 109 Input to step 1
result_10.docx 30 Output of step 2
result_10_Aspose_Words_Document_Save_asPDF.pdf 510 Output of step 4. Time it takes Aspose.Words to wordDocument.Save(…, SaveOptions.PDF) the document: 3 seconds
result_10_Word_PrintToPDF.pdf 1.350 Using MS Word to Print to PDF the file result_10.docx.
result_10_Word_SaveAs_PDF.pdf 910 Using MS Word to Save as → PDF the file result_10.pdf
Anonymized_Data2_500.xlsx 413
result_500.docx 593
result_Aspose_Words_Document_Save_asPDF_500.pdf 34.535 99 seconds
result_500_Word_PrintToPDF.pdf 55.738
result_500_Word_SaveAs_PDF.pdf 65.773
Anonymized_Data2_3292.xlsx 2.080
result_3292.docx 3.661
result_Aspose_Words_Document_Save_asPDF_3292.pdf 223.782 692 seconds (which is on the low end over multiple tests)
result_10_Word_PrintToPDF.pdf Unknown
result_10_Word_SaveAs_PDF.pdf Unknown

If you want to investigate the resulting file size and runtime, you can simply the following code. Our tests have been conducted in .NET Framework 4.8 using the latest Aspose.Words version 2025.03.

// This code simply loads the docx document saved after step 2 and performs step 3 of the pipeline:
String path = @"C:\Testdata\";
Aspose.Words.Document wordDocument = new Aspose.Words.Document(path + "result_*.docx");
wordDocument.Save(path + "result_*_Aspose_Words_Document_Save_asPDF.pdf", SaveFormat.Pdf);

Testdata.zip (5.6 MB)

Due to the exorbitant file sizes, I’ve not been able to attach all the mentioned PDF files, but I’m fairly certain, you’ll be able to recreate them by running the above code.

Besides the fact, that (for the complete document) it takes approx 15 minutes to only generate/save the PDF and it’s enormous size of over 220 MB, this process also grabs all the available RAM/memory that it can allocate. On my box, this resulted in over 16 GB being used by the conversion process alone! Putting all the sizes into perspective again:

  • original word document: < 4 MB
  • resulting PDF: ~ 224 MB
  • RAM used: > 16_000 MB

Hopefully the results of your investigation can help improve performance in your Aspose.Words component in regards to runtime, output file size and memory used.

Kind regards.

@M.Heinz Thank you for the detailed description of the problem. But, I am afraid, I do not think we have a room for improvements here. Even according to your table, the output PDF document size produced by Aspose.Words is smaller then PDF document size produced by MS Word.
The biggest attached document cannot be converted to PDF using MS Word at all. It is simply to large. To process the document Aspose.Words requires several times more memory then the original document size. Please see our documentation to learn more:
https://docs.aspose.com/words/net/memory-requirements/

In your case the input document is DOCX. As you may know it is a ZIP archive with XML inside. If unzip it the document.xml size is about 150MB, that is enormously big for TXT/XML file. PDF document format is less compact than DOCX, so it is expected that the generated PDF file is bigger than the original file.

Conversion of the biggest document you have attached to PDF takes above 11 minutes on my side and produced an enormously huge PDF document with 53930 pages, which cannot be handled by Adobe Acrobat Pro on my side. And I am afraid none of PDF viewers will be able to process such monstrous document.

So, I can only suggest you to avoid using such large documents.

Adobe Acrobat Reader is in fact able to open and display said large document - all be it with a little bit of lag.

I’m aware, that DOCX is a zipped file format and I’ve noticed the file size of >150MB when unpacked, but I would have expected for PDF to also be able to compress content - but I might be wrong there.

And again: As relayed to your colleague @amjad.sahi in the Aspose.Cells forum and only by being incentivized by them, I’ve submitted this report for you to be able to analyze your potentials. You might as well perform analysis on the smaller documents, if you’re inclined to do so.

I’m certainly not expecting you to be able to display 57 columns of data in a word table on a single page (and especially not in combination with approx. 3300 rows).

Internally, we’ve already moved on from this specific document as we deem it impractical for the aforementioned processing pipeline - although some of our clients might think otherwise.

@M.Heinz MS Word documents themselves have limitations. The maximum number of columns in MS Word table is 63. maximum page size is limited to 1584pt.