List numbers getting duplicated while converting from html to word

Hi Team,

I am getting issues while converting html to docx. The converted document is having special character (“Party Aâ€) , also page numbering is coming incorrectly , page breaks getting removed , margins getting lost.

Code :

public static void main(String... args) throws Exception {
    com.aspose.words.License license = new com.aspose.words.License();
    license.setLicense("/home/saurabharora/Downloads/Aspose.Total.Product.Family.lic");

    String modifiedHTML = Files.readString(Paths.get("/home/saurabharora/Downloads/ckeditor.html"), StandardCharsets.UTF_8);
    modifiedHTML = modifiedHTML.replaceAll("[\uFEFF-\uFFFF]", "");
    FileUtil.writeToFile("/home/saurabharora/Downloads/ckeditor1.html", modifiedHTML.getBytes());
    Document htmlDoc = new Document("/home/saurabharora/Downloads/ckeditor1.html");
    htmlDoc.save("/home/saurabharora/Downloads/docfromckeditorhtml.docx", SaveFormat.DOCX);

}

htmltodocx.7z (40.4 KB)

Please help.

@ashu_agrawal_sirionlabs_com The input HTML looks like a postprocessed Aspose.Words produced HTML. Specifically -aw-import:ignore attributes are removed from list item numbers spans:

<span style="font-size:10pt">(a)</span>

That is why these spans are not ignored by Aspose.Words and are imported as simple text that duplicates list labes.

Thanks for the reply.

Can you please tell the reason why page numbering is coming incorrectly , page breaks getting removed , margins getting lost.

@ashu_agrawal_sirionlabs_com Please note, Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one format to another. In most cases Aspose.Words mimics MS Word behavior when work with HTML documents.
Also to preserve a much as possible upon MS Word to HTML conversion, Aspose.Words writes roundtrip information into the output HTML document. But inn your case it looks like after postprocessing part of roundtrip information has been removed from HTML, that might cause the problems after converting HTML back to MS Word.

@alexey.noskov , thanks for the reply. Now i am trying to preserve the round trip information. Numbering is corrected , also the invalid characters issues. But section break , page break and margin issue still exist. Can you please help on this.

Updated code :

public static void main(String... args) throws Exception {
    com.aspose.words.License license = new com.aspose.words.License();
    license.setLicense("/home/saurabharora/Downloads/Aspose.Total.Product.Family.lic");
    String modifiedHTML = Files.readString(Paths.get("/home/saurabharora/Downloads/ckeditor.html"), StandardCharsets.UTF_8);
    modifiedHTML = modifiedHTML.replaceAll("[\uFEFF-\uFFFF]", "");
        
    Document document = new Document();
    DocumentBuilder documentBuilder = new DocumentBuilder(document);
    documentBuilder.insertHtml(modifiedHTML);
    document.updateFields();
    document.updatePageLayout();
    document.save("/home/saurabharora/Downloads/docfromckeditorhtml12.docx");
}

htmltodoc.7z (40.1 KB)

Please help.

Hi Team,

Any update?

@ashu_agrawal_sirionlabs_com Could you please attach the source MS Word document the attached HTML file was produced from? I still see that the HTML has been postprocessed.
Please note that it is not always possible to provide 100% fidelity after DOCX->HTML->DOCX roundtrip due to significant differences in HTML and MS Word document object models.

@alexey.noskov , thanks for the reply. I will give you walk through of our flow. We have a document (original.docx) and we convert it to html using the following code :

HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
opts.setExportPageSetup(true);
opts.setExportListLabels(ExportListLabels.BY_HTML_TAGS);
opts.setExportImagesAsBase64(true);
opts.setExportFontsAsBase64(true);
opts.setExportTocPageNumbers(true);
document.save("/home/saurabharora/Downloads/htmlfromoriginalDoc.html", opts);

Then open the ck4 editor using the above html and do some edits and get the updated html from ck4 editor and try to convert it back to word document. When we convert it back to word document , we are getting these issues. I am attaching the original document , html from that document and the edited html.

html_issues.7z (48.8 KB)

Please help.

@ashu_agrawal_sirionlabs_com Thank you for additional information. But as I have already mentioned it is not always possible to provide 100% fidelity after DOCX->HTML->DOCX roundtrip due to significant differences in HTML and MS Word document object models.