We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Structured Document Tags Lost During Round Trip to HTML

Hey guys,

We’re experiencing two main issues with SDT tags in Aspose.Words 15.5 for .NET.

We’re attempting to open a docx file and save it as html (using the ExportRoundtripInformation property), as described in these two articles:

We’re using almost the exact same code from article (I’ve modified it by adding some Console.WriteLine’s because it’s so slow, and it lets me see what’s going on).

1.) The first issue is that not all of the content controls get written out to the HTML (many do, but the ones that contain only a table only get partially written out. They don’t contain the proper style(s), and just become a child of the first cell of the table for the opening tag, and it’s missing the title information all together, and there is no closing tag – so it’s hard to make use of that).

2.) And even more problematic: when you read in the html, even if you save it back to a docx file, the content controls are completely stripped out. – Even viewing the document in Word has them completely removed. Zero SDT tags in the document output!

This is pretty problematic for us!

A third oddity I’ve noticed:

3.) The page count and file size of the round-tripped docx is drastically different. The round-tripped docx appears to be about 34 pages longer than the input document (and about 5 MB smaller).

Here’s the basic code I’m using:

var inputDocxFile = @“C:\path\to\NEM_Master.docx”;

var folder = Path.GetDirectoryName(inputDocxFile);

var fileName = Path.GetFileNameWithoutExtension(inputDocxFile);

var outputHtmlFile = Path.Combine(folder, fileName + “-roundtrip.html”);

var outputDocxFile = Path.Combine(folder, fileName + “-roundtrip.docx”);

//Load the document into Aspose.Words.

Console.WriteLine(“Opening " + Path.GetFileName(inputDocxFile) + " . . .”);

var doc = new Document(inputDocxFile);

Console.WriteLine(" Content Control Count: " +

doc.Descendants().Where(node => node.NodeType == NodeType.StructuredDocumentTag).Count());

Console.WriteLine(" Page Count: " + doc.PageCount);

HtmlSaveOptions options = new HtmlSaveOptions();

//whether to write the roundtrip information when saving to HTML, MHTML or EPUB.

//Default value is true for HTML and false for MHTML and EPUB.

options.ExportRoundtripInformation = true;

Console.WriteLine(" Saving " + Path.GetFileName(outputHtmlFile) + " . . .");

doc.Save(outputHtmlFile, options);

// Reload from HTML

Console.WriteLine();

Console.WriteLine(“Opening " + Path.GetFileName(outputHtmlFile) + " . . .”);

doc = new Document(outputHtmlFile);

Console.WriteLine(" Content Control Count: " +

doc.Descendants().Where(node => node.NodeType == NodeType.StructuredDocumentTag).Count());

Console.WriteLine(" Page Count: " + doc.PageCount);

//Save the document Docx file format

Console.WriteLine(" Saving " + Path.GetFileName(outputDocxFile) + " . . .");

doc.Save(outputDocxFile, SaveFormat.Docx);

Console.WriteLine();

Console.WriteLine(“Opening " + Path.GetFileName(outputDocxFile) + " . . .”);

doc = new Document(outputDocxFile);

Console.WriteLine(" Content Control Count: " +

doc.Descendants().Where(node => node.NodeType == NodeType.StructuredDocumentTag).Count());

Console.WriteLine(" Page Count: " + doc.PageCount);

Console.WriteLine();

And here’s Console Output from that (I’ve also verified by hand that this is accurate):

Opening NEM_Master.docx . . .

Content Control Count: 2496

Page Count: 168

Saving NEM_Master-roundtrip.html . . .

Opening NEM_Master-roundtrip.html . . .

Content Control Count: 0

Page Count: 191

Saving NEM_Master-roundtrip.docx . . .

Opening NEM_Master-roundtrip.docx . . .

Content Control Count: 0

Page Count: 202

Press any key to continue . . .

Reading the articles I linked in the above (especially the first one, as it calls out that Structured Document Tags directly), it seems both items 1 and 2 should be working. Is that correct? – Is this a bug in Aspose.Words?

Thank you,

–Mikey

Hi Mikey,

Thanks for your inquiry.

MikeJ-1:

1.)
The first issue is that not all of the content controls get written out
to the HTML (many do, but the ones that contain only a table only get
partially written out. They don’t contain the proper style(s), and just
become a child of the first cell of the table for the opening tag, and
it’s missing the title information all together, and there is no closing
tag – so it’s hard to make use of that).

Could you please attach your input Word document here for testing? I will investigate the issue on my side and provide you more information.
MikeJ-1:

2.) And even more
problematic: when you read in the html, even if you save it back to a
docx file, the content controls are completely stripped out. – Even
viewing the document in Word has them completely removed. Zero SDT tags
in the document output!

I have tested this scenario with a simple document and
have managed to reproduce the same issue at my side. For the sake of
correction, I have logged this problem in our issue tracking system as
WORDSNET-12151. I have linked this forum thread to the same issue and
you will be notified via this forum thread once this issue is resolved. We apologize for your inconvenience.
MikeJ-1:

A third oddity I’ve noticed:
3.) The page count and file size of the round-tripped docx is drastically different. The round-tripped docx appears to be about 34 pages longer than the input document (and about 5 MB smaller).

Please note that Aspose.Words mimics the same behavior as MS Word does. If you convert your document to Html using MS Word, you will get the same output.

Moreover, MS Word and html documents are two different types of document. So, it might be possible that the page count of Docx and HTML are different. Could you please attach your input Word document here for testing? I will investigate the issue on my side and provide you more information.