Extract HTML from Rich Text Content Control & Nested Date Plain Text SDT in Word Document using Java

Hi,

I am trying to extract html from rich text content control (Structured Document Tag). I use the following code snippet to extract the information :

std.toString(SaveFormat.HTML)

But the html returned does not contain information about the nested control control (std type like date , plain text).

I want to extract the complete html information in the rich text content control.

We are really struck at this as it impacts multiple use cases.

Thanks

@saurabh.arora,

Please ZIP and upload your sample input Word document you want to extract HTML representation of here for testing. We will then investigate the scenario on our end and provide you more information.

Hi,

Thanks for the reply.

Please find attached document (test.docx) . Also i have attached the complete html output file (test.html).

std_issue.zip (27.6 KB)

The issue is that the html extracted from content control (rich text std) does not contain nested std tags but the complete html has proper tags. I want to get that information in html extracted from rich text std.

Also please find the code used :

public static void main(String… args) throws Exception {
Document document = new Document("/home/sauravarora/Downloads/test.docx");

    for (Object st : document.getChildNodes(NodeType.STRUCTURED_DOCUMENT_TAG, true)) {
        StructuredDocumentTag std = (StructuredDocumentTag) st;
        if (std.getSdtType() == SdtType.RICH_TEXT) {
            System.out.println(std.toString(SaveFormat.HTML));
        }
    }
    document.save("/home/sauravarora/test.html", SaveFormat.HTML);

}

Thanks

@saurabh.arora,

Please check the following Aspose.Words for Java APi’s code will extract HTML strings from all the block level rich text content controls (including the HTML strings of nested content controls). Hope, this helps.

Document doc = new Document("E:\\std_issue\\test.docx");

for (StructuredDocumentTag sdt : (Iterable<StructuredDocumentTag>) doc.getChildNodes(NodeType.STRUCTURED_DOCUMENT_TAG, true)) {
    if (sdt.getLevel() == MarkupLevel.BLOCK && sdt.getSdtType() == SdtType.RICH_TEXT) {
        HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
        opts.setPrettyFormat(true);
        System.out.println(sdt.toString(opts));

        System.out.println("////////////////////");
        System.out.println("////////////////////");
    }
}

Forgot to reply. It worked sir.

Thanks a ton!!