Performance issues with LayoutCollector.getStartPageIndex in Java

Hi,

We are migrating one of our legacy .NET services to Java and have encountered a significant decrease in performance on a flow that involves numerous calls to the LayoutCollector.getStartPageIndex method.

Unfortunately, there is no way to reduce the number of calls, as we need to identify the page number on which each paragraph in an OOXML document occurs.

Running the code bellow takes more than 6 seconds for a 2MB test file (attached) using the 17.3.0 version of aspose-words, while in 16.4.0, which is our current production version, it’s even slower (running the same code takes more than 8 seconds):

public Map<Integer, Node> makeProcessingMap(final Document document) throws Exception {
    final Map<Integer, Node> processingMap = new HashMap<>();
    final LayoutCollector collector = new LayoutCollector(document);
    final NodeCollection nodes = document.getChildNodes(NodeType.PARAGRAPH, true);
    for (final Node node : nodes) {
        int startPageIndex = collector.getStartPageIndex(node);
        processingMap.put(startPageIndex, node);
    }
    return processingMap;
}

The legacy .NET code was less optimised and made a lot more calls to the GetStartPageIndex method, however the overall performance was significantly better (the Java implementation takes almost twice as long to process the same file).

We need to be able to process files that are much larger than this 2MB sample, but can’t get around this performance issue.

How can we overcome this performance gap between the Java and .NET implementations?

Thank you,
Oana

Hi Oana,

Thanks for your inquiry.

In this case, Aspose.Words needs to build a ‘page layout’ of the document internally. Roughly, Aspose.Words layouts 10 pages per second; so, the extra amount of time Aspose.Words takes to format a document into pages depends on the number of pages your Word document has. Also, please note that this process is not linear; it may take a minute to build layout of one page and may take a few seconds to process 100 pages. Put simply, the processing time and memory usage fully depend on your documents and their complexity.

We have tested the following code over Windows 10:

// Document load
Document doc = new Document("D:\temp\docx_2mb.docx");

// Rest of the code
LayoutCollector collector = new LayoutCollector(doc);

NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Node node : nodes)
{
    int startPageIndex = collector.getStartPageIndex(node);
    System.out.println(startPageIndex);
}

We have observed the following readings over .NET Framework 4.6 and Java 8 platforms. There are 308 pages in your Word document and around three seconds difference between .NET and Java platforms looks OK.

Readings Aspose.Words for .NET (17.3) Aspose.Words for Java (17.3)
Document Load (ms) Rest of the Code (ms) Document Load (ms) Rest of the Code (ms)
Reading 1 2003 13208 3439 16807
Reading 2 2083 11014 3799 12728
Reading 3 2230 11833 3598 12615
Average 2105.33 12018.33 3612 14050
Total (Avg) 14123.66 17662

Please let us know if we can be of any further assistance.

Best regards,

Hi Awais,

Thank you for your analysis! We’ll look further into it and see how we could improve the response time.

Have a nice day!
Oana.