Convert Paragraphs to HTML one by one has Performance issue using Java

Hi,

Our team is now evaluating several tools to perform a task in our system and Aspose.Words is one of them.

The goal of this task is to extract HTML from Word Document one paragraph at a time.
Aspose.Words is doing the job but the performance when extracting as HTML is not great. For a simple docx file (3 pages) it takes around 15s to complete the task.
Extracting as text from the same docx file takes around 4s.

Here is the sample of code I’m talking about:

private void addContentToCurrentPage(Paragraph paragraph) {
	// ignore text until first heading style is present
	if (fCurrentPage == null) {
		return;
	}

	String content;
	try {
		/*
		 HERE IS THE ISSUE!
		 */
		content = paragraph.toString(SaveFormat.HTML);
	} catch (Exception ex) {
		System.out.println("Unable to parse paragraph. Message: " + ex.getMessage());
		content = "Import error placeholder";
	}

	DocumentModel.Fragment fragment = createHtmlFragment(content);
	fCurrentPage.addFragment(fragment);
}

After some debugging I’ve observed that the extraction takes more for the paragraphs were the processing is throwing the following warning:
“DrawingML is not supported in Html format and will be converted to shape.” - I don’t get where it comes from because my document doesn’t have any image.

Is there any load option or possibility to make it run faster?

I’ve attached the java project containing the source code and the document in resources.
tc-aspose-evaluation.zip (126.4 KB)

Thank you,
Mihail

@mihail.manoli

We have tested the scenario using the latest version of Aspose.Words for Java 21.1 with following code example and have not found the shared issue. So, please use Aspose.Words for Java 21.1.

Document doc = new Document(MyDir + "DemoDocumentSimple.docx");
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
System.out.println(paragraphs.getCount());
String content = "";
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    content = paragraph.toString(SaveFormat.HTML);
}

Please note that performance and memory usage all depend on complexity and size of the documents you are generating.

Moreover, the first call of “new Document()” will cause to load all related classes and system buffer instantiation. The static Aspose.Words resources (document styles, fonts, border arts, etc.) are loaded lazily – only when they really needed and after loading they are cached during the session. So the second call of “new Document()” will not cause class loading. If your JRE uses JIT the behavior is more complex because of several intelligent levels of byte-code compilation and optimization.

@tahir.manzoor

I’m using the latest Aspose Words for Java (21.1) and the above code is also very slow. IMHO, the provided document from the archive is not complex - has 3 pages with several paragraphs.

In VisualVM I can see that the most CPU time is spent on com.aspose.words.zz3L.visitShapeStart() method. You can see this in the attached screenshot. visualvm-cpu-time.png (92.9 KB)

My questions are:

  1. why the above method is called if I don’t have any images in the document?
  2. is there any possibility to bypass it or to reduce the execution time of paragraph.toString(SaveFormat.HTML) method?

I don’t know if is relevant, but the Java application is deployed in a Wildfly 18 application server which runs in a Docker container starting from Debian Strech. Java: version 1.8

@mihail.manoli

Your document contains the shape. Please check the attached image for detail. image.png (11.4 KB)

You can remove the shape nodes from the document using following code snippet and call the paragraph.toString(SaveFormat.HTML) method to get the desired output.

Document doc = new Document(MyDir + "in.docx");
doc.getChildNodes(NodeType.SHAPE, true).clear();

You are calling Node.Save method multiple times. This is the reason the application takes time.

@tahir.manzoor
Thank you for your reply.

Indeed, the document contains one image, but nothing changes in terms of performance after removing all the shapes with doc.getChildNodes(NodeType.SHAPE, true).clear() before processing the docx with the DocumentVisitor implementation.

Is there any way to extract from a docx document one paragraph at a time as HTML without using a DocumentVisitor implementation?

I think that I’ve found why it was taking so long to parse the document. The evaluation copy of the Aspose Words for java library inserts an image in the header of the word document before processing it.
You can see this in the attached image: aspose-evaluation-copy-header.png (36.1 KB)

I’ve requested a temporary license and the execution time decreased dramatically from 15s to 600ms. I think it will be useful for others and helpful for your sales to specify this fact in the product documentation.

@mihail.manoli

You can use following code example to convert paragraph to HTML.

Please note that in evaluation mode there are some limitations applied. E.g. Aspose.Words injects an evaluation watermark at the top of the document. The document’s content are truncated after a certain number of paragraphs during import or export.

It is nice to hear from you that your problem has been solved. Please let us know if you have any more queries.

The title should be “… using Java”, not .NET

The code samples are in Java.

@mihail.manoli,

You are right; I have updated the title and replaced .NET with Java.