Issue in html extraction from word document wrt to structure document tags (content controls)

saurabh.arora · September 29, 2022, 8:13pm

Hi Team,

I am extracting html from word document. The output html does not contain complete information about content controls.It does not specify the type of content control. I want to write this html in another document and want content control nodes to be preserved.

Code used :

public static void main(String... args) throws Exception {
        com.aspose.words.License license = new com.aspose.words.License();
        license.setLicense("/home/saurabharora/aspose-licence.xml");
        Document document = new Document("/home/saurabharora/Downloads/CL06906-null-2022-09-28.docx");
        //String docHtml = document.toString(SaveFormat.HTML);
        HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
        opts.setPrettyFormat(true);
        String docHtml = document.toString(opts);
        System.out.println(docHtml);
    }

Document :
test_html.zip (15.5 KB)

Html output :

<html>
	<head>
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
		<meta http-equiv="Content-Style-Type" content="text/css" />
		<meta name="generator" content="Aspose.Words for Java 22.2.0" />
		<title>
		</title>
	</head>
	<body style="font-family:'Times New Roman'; font-size:12pt">
		<div>
			<p style="margin-top:0pt; margin-bottom:0pt; font-size:13pt">
				<span style="font-weight:bold; -aw-import:ignore">&#xa0;</span>
			</p>
			<h2 style="margin-top:0pt; margin-left:36pt; margin-bottom:14.95pt; text-indent:-18pt">
				<span style="font-style:italic; -aw-import:ignore">&#xa0;</span>
			</h2>
			<p style="margin-top:12pt; margin-bottom:12pt">
				<span style="font-family:Arial">In computer programming, characters are pieced together to form strings, which are data types that are often implemented into bytes of data that can be read by computers. With online activity and the</span><span style="font-family:Arial">&#xa0;</span><span style="-aw-sdt-tag:'BASIC__101__15045__206__206'; -aw-sdt-title:'ankush Currency FieldCDR'"><span style="font-family:Arial; background-color:#00ffff">ankush Currency FieldCDR</span></span><span style="font-family:Arial">&#xa0;</span><span style="-aw-sdt-tag:'BASIC__101__15548__206__206'; -aw-sdt-title:'AnkushNumerictag'"><span style="background-color:#00ffff">AnkushNumerictag</span></span><span>&#xa0;</span><span style="-aw-sdt-tag:'BASIC__101__15577__206__206'; -aw-sdt-title:'Ankush TestField'"><span style="background-color:#00ffff">Ankush TestField</span></span><span>&#xa0;</span><span style="-aw-sdt-tag:'BASIC__101__14886__206__206'; -aw-sdt-title:'Akki Text Field test'"><span style="background-color:#00ffff">Akki Text Field test</span></span><span>&#xa0;</span>
			</p>
			<p style="margin-top:12pt; margin-bottom:12pt">
				<br /><span style="-aw-import:ignore">&#xa0;</span>
			</p>
			<p style="margin-top:0pt; margin-bottom:6pt">
				<span style="-aw-import:ignore">&#xa0;</span>
			</p>
		</div>
	</body>
</html>

Please help.

mlyra · September 30, 2022, 1:05am

@saurabh.arora

Can you please inform what version of Aspose.Words are you using?

alexey.noskov · September 30, 2022, 6:26am

@saurabh.arora Aspose.Words preserves Structured Document Tags in HTML, but since HTML and MS Word documents models are very different it is difficult and sometimes impossible to get 100% fidelity after Word->HTML->Word roundtrip. In this particular case Aspose.Words preserves SDTs using Aspose.Words specific HTML attributes aw-sdt-XXX:

<span style="-aw-sdt-tag:'BASIC__101__15045__206__206'; -aw-sdt-title:'ankush Currency FieldCDR'">

If I understand your scenario properly, you need to copy content from one document to another document and you use HTML as intermediate content holder. If so, you can consider using whole document as a content holder and use DocumentBuider.insertDocument to insert content from one document into another. In this case document will not be converted to “not native format” and all MS Word features used in the source document will be preserved.

saurabh.arora · September 30, 2022, 12:56pm

We are using words 22.2.

saurabh.arora · September 30, 2022, 1:09pm

Thanks for the reply.

Is there no way to retain content type information when we convert document to html. As when i insert the html back to new document , all tags are converted as plain text tags.

I tried the other approach of inserting the complete document into new document and it works. Our framework is developed to work on html, so was thinking if we could have some workaround for first approach.

Thanks

alexey.noskov · September 30, 2022, 1:25pm

@saurabh.arora Unfortunately, quite difficult to provide a workaround to preserve all MS Word features in HTML. Is HTML format is selected as an intermediate format because it can be stored as a string? Or there are some other reasons? For example editing HTML in WYSIWYG editor. If storing content is the only reason, you can use FlatOpc format as an intermediate format, which is the same as DOCX format but in flat XML representation.

saurabh.arora · January 19, 2023, 8:11pm

Hi Team,

Is there any update on this. Can we get content control type in latest version in extracted html?

alexey.noskov · January 20, 2023, 6:36am

@saurabh.arora I am afraid Aspose.Words behavior was not changed since 22.2 version regarding preserving SDTs in HTML. Aspose.Words still uses aw-sdt-XXX attributes to preserve basic SDT information for roundtrip.