Low performance when save document to PDF format through Aspose Word Java library

zjyin · August 8, 2011, 2:47am

Hi Aspose,

I tested a simple Java application to create different Word and PDF format documents with the same program and the same Aspose Word Java library. But to my surprise, the biggest Word document in my test environment can be 170931Kb(30455 pages) in 2 minutes, while the biggest PDF document only reach 6919Kb(4667 pages) and it need about 23 minutes. In my test, I set the max heap size to 1024M. So what happened on the PDF output? Why PDF output has so low performance?

Please reference the code:

License license = new License();

try
{
    int count = Integer.valueOf(args[0]);
    int type = 1;

    if (args.length > 1)
    {
        type = Integer.valueOf(args[1]);
    }
    // specify the exact class to be used to allow for the WordCOMDriver
    // to be inherited
    InputStream is = new FileInputStream("Aspose.Words.lic");
    license.setLicense(is);

    DocumentBuilder builder = new DocumentBuilder();
    builder.writeln();
    builder.getParagraphFormat().clearFormatting();

    Style style = builder.getDocument().getStyles().get("Normal");
    if (style != null)
    {
        builder.getParagraphFormat().setStyle(style);
    }
    builder.clearRunAttrs();

    for (int i = 0; i < count; i++)
    {
        builder
        .writeln("This year, the CDL Performance Engineering Community will share a series of good articles we’ve found useful to you each month. Every articles we shared has been carefully selected and studied. For the following two months, we will start from DB2 and Agile Performance Engineering topics. We have also created a forum topic for each article for you to share your impressions and questions, so feel free to click on the relevant forum link and join the discussion! Wish you all Happy reading!");
        builder
        .writeln("This year, the CDL Performance Engineering Community will share
        a series of good articles we’ve found useful to you each month.Every
        articles we shared has been carefully selected and studied.For the
        following two months, we will start from DB2 and Agile Performance
        Engineering topics.We have also created a forum topic for each article
        for you to share your impressions and questions, so feel free to click
        on the relevant forum link and join the discussion! Wish you all Happy
        reading!");
    }

    if (type == 2)
    {
        builder.getDocument().save("c:\test_pdf.doc");
    }
    else
    {
        builder.getDocument().save("c:\test_pdf.pdf");
    }
}
catch (Exception e)
{
    e.printStackTrace();
}

Thanks & regards.

AndreyN · August 8, 2011, 3:08am

Hello
Thanks for your request. Processing time and memory consumption fully depends on your documents and their complexity.
Usually, Aspose.Words needs 10 times more memory than the original document size to build a DOM in the memory. But this depends on input file format and document complexity, as I already mentions.
Processing time also depends on what you will do with documents, if you simply open and save the document, processing will be very fast. But if you need to perform some complex operations like rendering, then processing will take a little longer. For example, Aspose.Words renders about 10 pages per second.
Best regards,

zjyin · August 8, 2011, 3:50am

In my code example, I just created a very simple structure document through:

builder.writeln("xxx");
builder.writeln("yyyy");

And it did not open any existed document.

My question is why save to PDF need more time and memory than Word even the document has the same content. Please clarify.

Thanks.

alexey.noskov · August 8, 2011, 5:29am

Hi
Thanks for your request. As you may know Ms Word document is flow document, it means that it does not contain any information about its layout into pages.
Upon exporting to PDF or to any other fixed page format (XPS, SWF, Image) Aspose.Words needs to layout the document into pages. This operation is quite time and memory consuming. That is why exporting of the same document as Word document is much faster than exporting to PDF.
Best regards,

dragos.cojocari · August 8, 2011, 7:50am

Hi Alexey,

thanks for the details but the differences are still quite large between previous and current PDF capabilities:
- Aspose.PDF 2.4 could save 30.000 pages plain text PDF documents
- Aspose.Words 10.03 goes OOM with ~4500 pages plain text PDF documents

I understand there are major differences between Aspose PDF 2.4 and Aspose Words 10.03, including the PDF version being used ( 1.4 vs 1.5), yet the differences are too big, almost an order of magnitude.

Also it can be argued that a PDF Document with 30.000 is a bit on the extreme
side, still it is perfectly
legit in some cases. A 4k page document on the other hand is not uncommon. And we are talking about plain text here, no images, tables or
other elements which I would expect are more expensive memory-wise.

In general memory consumption seems much higher with the new driver. It looks like writing plain text documents between 2k-3k pages will get very close to the 1GB heap mark and the time of the save starts increasing in a disproportionate pace with the number of pages.

With these limitation the Aspose.Words 10.03 is unusable for PDF output for anything above 2000 pages due to the impact on resources and the possibility of OOM crashes. If you put this in the context of a server application, with multiple documents created at the same time, the size of the documents that can be reliably generated further decreases.

Your guidance and help in enhancing the scalability of the PDF export are highly appreciated. We have been using the 2.x versions of Word and PDF drivers very successfully , kudos for creating a great set of libraries, and have been eagerly expecting an Aspose.Words for Java that has the ability to save multiple formats from the same memory model.

Regards,
Dragos

zjyin · August 8, 2011, 7:52am

Thank for your clarification. Are there any ways or plan to improve the memory and time for PDF output from Aspose.Words?

Thanks.

alexey.noskov · August 8, 2011, 9:06am

Hi
Thanks for your request. We always work on improving performance. But as you can understand rendering is quite complex task and it will be always running slower than simple saving to flow formats.
Best regards,

dragos.cojocari · August 8, 2011, 9:24am

Hey Alexey,

in this case the slowness is not caused by the complexity of the model or the format but by the fact that heap becomes consumed and GC has to kick in to try to free more memory. As long as the memory consumed is below the heap size performance is very good.

Any feedback on the very large difference in capability between Aspose.PDF 2.4 and Aspose.Words 10.03 for saving PDF?

Regards,
Dragos

alexey.noskov · August 8, 2011, 9:38am

Hi
Thank you for additional information. Memory usage also depends on your document and operation you are performing with it. When you open or build a document, Aspose.Words creates a DOM in memory. When you render the document, Aspose.Words also has to create APS (Aspose Page Specification) model in memory. So memory usage will increase when you render the document.
Regarding Aspose.Pdf and Aspose.Words. Aspose.Words is designed to work with flow documents (MS Word documents). So when you attempt to save the created document to PDF Aspose.Words needs to layout it into pages. On other hand Aspose.Pdf is designed to work with PDF documents. When you create the document you build the document on pages. So it is not needed to layout document, but simply write model to PDF file.
Best regards,

dragos.cojocari · August 8, 2011, 9:53am

Hi Alexey,

thanks for the details. Does this mean that Aspose.Words should not be used for PDF?

Regards,
Dragos

AndreyN · August 8, 2011, 10:02am

Hello
Thanks for your inquiry. Aspose.Words for Java is a class library that enables your applications to perform a great range of document processing tasks. Aspose.Words supports DOC, DOCX, RTF, HTML, OpenDocument, PDF, XPS, EPUB and other formats. With Aspose.Words you can generate, modify, convert and render documents without utilizing Microsoft Word®.
Aspose.Words for Java features can be divided into four main areas:
· Conversions. High quality conversions to and from DOC, OOXML, RTF, WordprocessingML, HTML, MHTML, TXT and OpenDocument formats.
· Document Object Model. Programmatic access through a rich API to all document elements and formatting allows to create, modify, extract, copy, split, join, and replace document content.
· Rendering. Convert whole documents or pages to PDF, XPS or SWF for server-side document generation. Also convert document pages to PNG or BMP images.All with high fidelity - exactly like Microsoft Word® would have done it.
· Reporting. Generate documents or reports from scratch or by filling templates with data from data sources or business objects.
Please see the following link to learn more:
https://docs.aspose.com/words/java/product-overview/
Best regards,

dragos.cojocari · August 8, 2011, 10:15am

Hey Andey,

thanks for the details. This was the very reason of us to moving to Aspose.Words for both our Word and PDF outputs. And while the quality of the PDF documents is excellent, the library’s scalability when saving PDF is a big blocker.

Regards,
Dragos

dragos.cojocari · August 8, 2011, 10:17am

Hey Andrey and Alexey,

given all said above what are your recommendations for solving this issue and do you have plans/solutions to improve the scalability of the Aspose.Words library for saving PDF?

Regards,
Dragos

AndreyN · August 8, 2011, 10:32am

Hi
Thanks for your inquiry. As Alexey mentioned we always work on improving performance. But rendering will be always running slower than simple saving to flow formats it is expected behavior.
Best regards,

dragos.cojocari · August 8, 2011, 12:53pm

Hey Andrey,

thanks but that doesn’t answer the question: how can I reliably save PDF documents, how can I avoid the OOM from occurring? This is not a performance issue, that is a side effect of memory being depleted. As I mentioned before performance is very good until the memory is depleted (same goes for Word but at much higher page counts).

So to make the question more clear: can I use Aspose.Words 10.03 to generate a 10000 pages PDF on 32 bit, will I be able to do this in Aspose.Words 10.0x or I’m looking at the wrong library here and I should use something else? Do you know if the .NET version exhibits the same limitations (I’m kind of hoping this is a porting issue)?

Appreciate all your help and patience so far.

Regards,
Dragos

PS: a similar issue was reported here

AndreyN · August 8, 2011, 1:15pm

Hi Dragos,
Thanks for your request. There is no limit of number of pages or document size. The only limit is amount of available memory on your side.
But you should note, when you open document using Aspose.Words, document object model is built in the memory. DOM always takes few times more memory that the original document size.
So maybe your document is too huge. I think, in your case, it is better to use few smaller documents instead of one huge document. You can try conversion your Word documents to PDF using Aspose.Words and then you can try using Aspose.Pdf.Kit to concatenate PDF documents. Please see the following link for more information:
https://docs.aspose.com/pdf/net/concatenate-pdf-documents/
By the way, even MS Word does not like such large documents.
Best regards,

dragos.cojocari · August 8, 2011, 3:19pm

Hey Andrey,

>>> There is no limit of number of pages or document size. The only limit is amount of available memory on your side.
Which on a 32 bit JRE its around 1GB on most systems hence the current predicament. But even on a 64 bit environment and more than 1GB of memory the amount of memory consumed is quite disproportionate with the result.

>>> DOM always
takes few times more memory that the original document size.
Understood and agreed. But in this case we are taking about ~20 times more memory ( 1GB+ memory for a 5MB file).

>>> By the way, even MS Word does not like such
large documents.

Yes, but I would not qualify a 5k pages doc as huge.

>>> So maybe your document is too huge.
Not quite. In some domains you need to create such large documents for various purposes. One of the most commonly invoked being for audit. And the problem is that the old PDF driver could handle larger models with no problems.

>>> I
think, in your case, it is better to use few smaller documents instead of one
huge document. You can try conversion your Word documents to PDF using
Aspose.Words and then you can try using Aspose.Pdf.Kit to concatenate PDF documents.
Thanks for the idea, it is very interesting. Can the split be mid-page and can the Pdf.kit reassemble the document as it would have been generated in a single pass or will it introduce page breaks/section breaks, lose the header/footer etc? What about memory consumption? Is it safe to assume that the memory footprint is similar to that of the Aspose.PDF driver given that you already have the page information?

LE: forgot a big question. Is page numbering and TOC/TOT/TOF supported by this technique?

Regards,
Dragos

adam.skelton · August 8, 2011, 7:26pm

Hi Dragos,
Thanks for this additional information.
The straight answer is, you most likey can’t achieve what you are looking for in a 32-bit environment using Aspose.Words. Loading such massive documents into memory and laying out these documents for rendering will always take a large amount of memory. This is unavoidable as the document elements are loaded into memory in complete detail. This allows for full modification of any part of the document and high fidelity conversions to many other formats.
Aspose.Words is used to faithfully render Word documents containing with complex layouts and elements to PDF. If your documents only contain plain text, then I think it would be suitable for you to continue to use the 2.x versions of the libraries for your need.
Thanks,

dragos.cojocari · August 9, 2011, 3:14am

Hey Aske,

our documents contain much more than plain text: images, OLEs, tables, hyperlinks, bookmarks, comments, page headers/footers, rich formatting etc. In this test we have used just plain text to give you guys a very simple way to reproduce and identify the problem.

Regards,
Dragos

AndreyN · August 9, 2011, 5:52am

Hello
Thanks for your inquiry. Now let me explain why Aspose.Words uses more memory than document size. Document after loading into the memory is stored in DOM (Document object Model). If document contains mostly text content, Aspose.Words requires approximately 40 times more memory than the original DOCX document size (10 times more memory than DOC file size). So if your DOCX document size is 20MB, to load this document you need 800MB of memory. Then when you save document to PDF, Aspose.Words needs to build layout of the document that also stored in the memory. So I think to convert such huge document to PDF you need approximately 2GB of available memory. It is expected behavior.
Best regards,