Hi,
When converting a word document to html, if we have numbered paragraph in the word document, then we have a poor formatting of the first line of the paragraph.
The first line is formatted using non-breakable space, hence the alignment is not clean.
Is there a way for us to improve the formatting ? Maybe an option in order to format using CSS instead of non-breakable space ?
Here is my files:
mondoc.docx is my input word document
mondoc.html is my output html document
mondoc.zip (25.3 KB)
Here is the code used to perform the conversion:
public static void main(final String... strings) {
try {
final License license = new License();
license.setLicense(LICENSE);
} catch (final Exception e) {
}
final String html;
try {
final LoadOptions lo = new LoadOptions();
lo.setLoadFormat(LoadFormat.AUTO);
lo.setEncoding(StandardCharsets.UTF_8);
final Document doc = new Document(DOCUMENT, lo);
doc.removeMacros();
doc.removeSmartTags();
doc.getChildNodes(NodeType.COMMENT, true).clear();
doc.joinRunsWithSameFormatting();
try (final NoBomByteArrayOutputStream bos = new NoBomByteArrayOutputStream()) {
final HtmlSaveOptions saveOptions = new HtmlSaveOptions(SaveFormat.HTML);
saveOptions.setExportListLabels(ExportListLabels.AS_INLINE_TEXT);
saveOptions.setExportTocPageNumbers(false);
saveOptions.setEncoding(StandardCharsets.UTF_8);
saveOptions.setExportImagesAsBase64(true);
doc.save(bos, saveOptions);
html = bos.toUtf8String();
}
} catch (final Exception e) {
throw new RuntimeException("invalid.corrupted");
}
try {
Files.write(Paths.get(HTML), html.getBytes());
} catch (IOException e) {
e.printStackTrace();
}
}