Our team is now evaluating several tools to perform a task in our system and Aspose.Words is one of them.
The goal of this task is to extract HTML from Word Document one paragraph at a time.
Aspose.Words is doing the job but the performance when extracting as HTML is not great. For a simple docx file (3 pages) it takes around 15s to complete the task.
Extracting as text from the same docx file takes around 4s.
Here is the sample of code I’m talking about:
private void addContentToCurrentPage(Paragraph paragraph) {
// ignore text until first heading style is present
if (fCurrentPage == null) {
return;
}
String content;
try {
/*
HERE IS THE ISSUE!
*/
content = paragraph.toString(SaveFormat.HTML);
} catch (Exception ex) {
System.out.println("Unable to parse paragraph. Message: " + ex.getMessage());
content = "Import error placeholder";
}
DocumentModel.Fragment fragment = createHtmlFragment(content);
fCurrentPage.addFragment(fragment);
}
After some debugging I’ve observed that the extraction takes more for the paragraphs were the processing is throwing the following warning: “DrawingML is not supported in Html format and will be converted to shape.” - I don’t get where it comes from because my document doesn’t have any image.
Is there any load option or possibility to make it run faster?
I’ve attached the java project containing the source code and the document in resources. tc-aspose-evaluation.zip (126.4 KB)
We have tested the scenario using the latest version of Aspose.Words for Java 21.1 with following code example and have not found the shared issue. So, please use Aspose.Words for Java 21.1.
Please note that performance and memory usage all depend on complexity and size of the documents you are generating.
Moreover, the first call of “new Document()” will cause to load all related classes and system buffer instantiation. The static Aspose.Words resources (document styles, fonts, border arts, etc.) are loaded lazily – only when they really needed and after loading they are cached during the session. So the second call of “new Document()” will not cause class loading. If your JRE uses JIT the behavior is more complex because of several intelligent levels of byte-code compilation and optimization.
I’m using the latest Aspose Words for Java (21.1) and the above code is also very slow. IMHO, the provided document from the archive is not complex - has 3 pages with several paragraphs.
In VisualVM I can see that the most CPU time is spent on com.aspose.words.zz3L.visitShapeStart() method. You can see this in the attached screenshot. visualvm-cpu-time.png (92.9 KB)
My questions are:
why the above method is called if I don’t have any images in the document?
is there any possibility to bypass it or to reduce the execution time of paragraph.toString(SaveFormat.HTML) method?
I don’t know if is relevant, but the Java application is deployed in a Wildfly 18 application server which runs in a Docker container starting from Debian Strech. Java: version 1.8
Your document contains the shape. Please check the attached image for detail. image.png (11.4 KB)
You can remove the shape nodes from the document using following code snippet and call the paragraph.toString(SaveFormat.HTML) method to get the desired output.
Document doc = new Document(MyDir + "in.docx");
doc.getChildNodes(NodeType.SHAPE, true).clear();
You are calling Node.Save method multiple times. This is the reason the application takes time.
Indeed, the document contains one image, but nothing changes in terms of performance after removing all the shapes with doc.getChildNodes(NodeType.SHAPE, true).clear() before processing the docx with the DocumentVisitor implementation.
I think that I’ve found why it was taking so long to parse the document. The evaluation copy of the Aspose Words for java library inserts an image in the header of the word document before processing it.
You can see this in the attached image: aspose-evaluation-copy-header.png (36.1 KB)
I’ve requested a temporary license and the execution time decreased dramatically from 15s to 600ms. I think it will be useful for others and helpful for your sales to specify this fact in the product documentation.
You can use following code example to convert paragraph to HTML.
Please note that in evaluation mode there are some limitations applied. E.g. Aspose.Words injects an evaluation watermark at the top of the document. The document’s content are truncated after a certain number of paragraphs during import or export.
It is nice to hear from you that your problem has been solved. Please let us know if you have any more queries.