We are migrating one of our legacy .NET services to Java and have encountered a significant decrease in performance on a flow that involves numerous calls to the LayoutCollector.getStartPageIndex method.
Unfortunately, there is no way to reduce the number of calls, as we need to identify the page number on which each paragraph in an OOXML document occurs.
Running the code bellow takes more than 6 seconds for a 2MB test file (attached) using the 17.3.0 version of aspose-words, while in 16.4.0, which is our current production version, it’s even slower (running the same code takes more than 8 seconds):
public Map<Integer, Node> makeProcessingMap(final Document document) throws Exception {
final Map<Integer, Node> processingMap = new HashMap<>();
final LayoutCollector collector = new LayoutCollector(document);
final NodeCollection nodes = document.getChildNodes(NodeType.PARAGRAPH, true);
for (final Node node : nodes) {
int startPageIndex = collector.getStartPageIndex(node);
processingMap.put(startPageIndex, node);
}
return processingMap;
}
The legacy .NET code was less optimised and made a lot more calls to the GetStartPageIndex method, however the overall performance was significantly better (the Java implementation takes almost twice as long to process the same file).
We need to be able to process files that are much larger than this 2MB sample, but can’t get around this performance issue.
How can we overcome this performance gap between the Java and .NET implementations?
In this case, Aspose.Words needs to build a ‘page layout’ of the document internally. Roughly, Aspose.Words layouts 10 pages per second; so, the extra amount of time Aspose.Words takes to format a document into pages depends on the number of pages your Word document has. Also, please note that this process is not linear; it may take a minute to build layout of one page and may take a few seconds to process 100 pages. Put simply, the processing time and memory usage fully depend on your documents and their complexity.
We have tested the following code over Windows 10:
// Document load
Document doc = new Document("D:\temp\docx_2mb.docx");
// Rest of the code
LayoutCollector collector = new LayoutCollector(doc);
NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Node node : nodes)
{
int startPageIndex = collector.getStartPageIndex(node);
System.out.println(startPageIndex);
}
We have observed the following readings over .NET Framework 4.6 and Java 8 platforms. There are 308 pages in your Word document and around three seconds difference between .NET and Java platforms looks OK.
Readings
Aspose.Words for .NET (17.3)
Aspose.Words for Java (17.3)
Document Load (ms)
Rest of the Code (ms)
Document Load (ms)
Rest of the Code (ms)
Reading 1
2003
13208
3439
16807
Reading 2
2083
11014
3799
12728
Reading 3
2230
11833
3598
12615
Average
2105.33
12018.33
3612
14050
Total (Avg)
14123.66
17662
Please let us know if we can be of any further assistance.