Performance issues with LayoutCollector.getStartPageIndex in Java

omihai · March 20, 2017, 10:46am

Hi,

We are migrating one of our legacy .NET services to Java and have encountered a significant decrease in performance on a flow that involves numerous calls to the LayoutCollector.getStartPageIndex method.

Unfortunately, there is no way to reduce the number of calls, as we need to identify the page number on which each paragraph in an OOXML document occurs.

Running the code bellow takes more than 6 seconds for a 2MB test file (attached) using the 17.3.0 version of aspose-words, while in 16.4.0, which is our current production version, it’s even slower (running the same code takes more than 8 seconds):

public Map<Integer, Node> makeProcessingMap(final Document document) throws Exception {
    final Map<Integer, Node> processingMap = new HashMap<>();
    final LayoutCollector collector = new LayoutCollector(document);
    final NodeCollection nodes = document.getChildNodes(NodeType.PARAGRAPH, true);
    for (final Node node : nodes) {
        int startPageIndex = collector.getStartPageIndex(node);
        processingMap.put(startPageIndex, node);
    }
    return processingMap;
}

The legacy .NET code was less optimised and made a lot more calls to the GetStartPageIndex method, however the overall performance was significantly better (the Java implementation takes almost twice as long to process the same file).

We need to be able to process files that are much larger than this 2MB sample, but can’t get around this performance issue.

How can we overcome this performance gap between the Java and .NET implementations?

Thank you,
Oana

awais.hafeez · March 21, 2017, 6:30am

Hi Oana,

Thanks for your inquiry.

In this case, Aspose.Words needs to build a ‘page layout’ of the document internally. Roughly, Aspose.Words layouts 10 pages per second; so, the extra amount of time Aspose.Words takes to format a document into pages depends on the number of pages your Word document has. Also, please note that this process is not linear; it may take a minute to build layout of one page and may take a few seconds to process 100 pages. Put simply, the processing time and memory usage fully depend on your documents and their complexity.

We have tested the following code over Windows 10:

// Document load
Document doc = new Document("D:\temp\docx_2mb.docx");

// Rest of the code
LayoutCollector collector = new LayoutCollector(doc);

NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Node node : nodes)
{
    int startPageIndex = collector.getStartPageIndex(node);
    System.out.println(startPageIndex);
}

We have observed the following readings over .NET Framework 4.6 and Java 8 platforms. There are 308 pages in your Word document and around three seconds difference between .NET and Java platforms looks OK.

Readings	Aspose.Words for .NET (17.3)	Aspose.Words for Java (17.3)
	Document Load (ms)	Rest of the Code (ms)	Document Load (ms)	Rest of the Code (ms)
Reading 1	2003	13208	3439	16807
Reading 2	2083	11014	3799	12728
Reading 3	2230	11833	3598	12615
Average	2105.33	12018.33	3612	14050
Total (Avg)	14123.66	17662

Please let us know if we can be of any further assistance.

Best regards,

omihai · March 21, 2017, 8:55am

Hi Awais,

Thank you for your analysis! We’ll look further into it and see how we could improve the response time.

Have a nice day!
Oana.