We are encountering a heap memory error in the following line:
document.save(byteArrayOutputStream, opts)
The document size is only 229.4 kB, and the method is used to extract HTML text from all content controls.after that we’re using Jsoup to fetch the HTML text based on each content control.
This method only works when we increase the JVM heap size (min and max) to 8GB, which seems excessive given the small document size.
Is there a more optimized or memory-efficient way to handle this case?
private static org.jsoup.nodes.Document extractHtmlDoc(Document document) throws Exception {
try {
HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
opts.setExportPageSetup(true);
opts.setExportListLabels(ExportListLabels.AS_INLINE_TEXT);
opts.setExportImagesAsBase64(true);
opts.setExportFontsAsBase64(true);
opts.setExportTocPageNumbers(true);
opts.setExportPageMargins(true);
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
document.save(byteArrayOutputStream, opts);
String html = byteArrayOutputStream.toString(StandardCharsets.UTF_8);
return Jsoup.parse(html);
} catch (Exception ex) {
LOGGER.error("Error while converting document to html due to exception ", ex);
return null;
}
}
@hariomgupta73
Thank you for sharing the details of the issue you are facing. To optimize memory usage while working with Aspose.Words for Java, you can follow these suggestions:
Optimizing Memory Usage
-
Explicitly Close Streams and Resources:
Ensure that any streams or resources used in your operations are explicitly closed after use. This can prevent memory leaks and unnecessary memory consumption.
-
Modify HtmlSaveOptions:
Review the HtmlSaveOptions
settings you are using. Some options might be causing higher memory usage than necessary. For example:
setExportImagesAsBase64(true)
and setExportFontsAsBase64(true)
can increase memory usage, especially for documents with many images.
- Consider if you need all these options enabled. Disabling some might help.
-
Use Pageable Object:
Instead of converting the entire document to HTML at once, you can use the Pageable
object to process the document in chunks. This helps in managing memory usage more efficiently.
Here’s a modified version of your method with these suggestions:
import com.aspose.words.*;
import org.jsoup.Jsoup;
private static org.jsoup.nodes.Document extractHtmlDoc(Document document) throws Exception {
HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
opts.setExportPageSetup(true);
opts.setExportListLabels(ExportListLabels.AS_INLINE_TEXT);
opts.setExportImagesAsBase64(false); // Try disabling this if possible
opts.setExportFontsAsBase64(false); // Try disabling this if possible
opts.setExportTocPageNumbers(true);
opts.setExportPageMargins(true);
Pageable pageableDocument = document.getPageable();
try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
for (int i = 0; i < pageableDocument.getCount(); i++) {
pageableDocument.export(byteArrayOutputStream, i); // Export chunks
}
String html = byteArrayOutputStream.toString(StandardCharsets.UTF_8);
return Jsoup.parse(html);
} catch (Exception ex) {
LOGGER.error("Error while converting document to HTML due to exception ", ex);
return null;
}
}
Make sure that setExportImagesAsBase64
and setExportFontAsBase64
are meant to be true as these encode images and fonts into the HTML file.
Additional Tips:
-
Garbage Collection:
- Explicitly trigger garbage collection before and after the conversion to free any unused memory.
System.gc();
-
JVM Tuning:
- Review and tune your JVM options:
java -Xms1024m -Xmx2048m -XX:+UseG1GC
Relevant Documentation:
Please test these suggestions and see if they help optimize memory usage in your application. If the problem persists, we may need to look into more specific details of your document and application setup.
@hariomgupta73 Could you please attach the problematic input document here for testing? We will check the issue on our side and provide you more information.
@alexey.noskov
The document causing the issue is a client file, which we are unable to share. Could you please investigate the issue without the actual document?
Is there any alternative method you would recommend to handle our use case?
Previously, we used a generic approach to extract standard HTML text using a loop. However, this approach resulted in timeouts for large documents containing more than 700 content controls. To address this, we optimized it using the shared method above, which has worked for all documents so far—until this exception occurred.
@hariomgupta73 It looks like the problem occurs with the specific document. So it is hard to tell what causes the problem without the problematic document. You can try anonymize the document by removing sensitive information making the document just enough to reproduce the problem.
In addition, please note it is safe to attach the documents in the forum, only you as the topic starter and Aspose staff can access the attachments.
Attaching the document by modifing text.
public static void main(String[] args) throws Exception {
com.aspose.words.License license = new com.aspose.words.License();
license.setLicense("/home/hariom/Ideaproject/contract-authoring/common/src/main/resources/aspose-licence");
Document document = new Document("/home/hariom/Downloads/test data.docx");
System.out.println("dd");
org.jsoup.nodes.Document htmlDocument = extractHtmlDoc(document);
if (Boolean.TRUE.equals(document.hasRevisions())) {
document.acceptAllRevisions();
}
System.out.println("dd");
}
test data.docx (304.5 KB)
@hariomgupta73
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): WORDSJAVA-3142
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.