Section Break generates invalid HTML that crashes Outlook

With version 8.0.0, when saving to HTML format, a section break is generated like this:

</div><br style="clear:both; mso-break-style:section-break; page-break-before:auto"><div>

Outlook spins into an infinite loop when it encounters an HTML email with this tag. There are two problems:

  1. tag in that position does not pass XHTML validation
  2. mso-break-style does not pass CSS validation

Fix #1: Wrap the tag with a <p> with no margin, to keep whitespace the same while also passing XHTML validation. Like this:

<p style="margin: 0pt"><br style="clear: both; mso-break-type: section-break; page-break-before: auto" /></p>

Fix #2: Add a setting to control whether mso-* styles are output in HTML. If you don’t need the document to be round-tripped, then eliminating the mso-* styles will produce CSS compliant output.

Hi

Thanks for your request. Aspose.Words writes section breaks like MS Word does. I think, we should make writing section breaks to HTML optional, so you can just disable them. I linked your request to the appropriate issue. You will be notified as soon as such option is available.
For now, I can suggest you simple workaround – just copy all content of your document into a single section. Please see the following code:

Document doc = new Document(@"in.doc");
// Append content of all section in the document to the first section.
while (doc.Sections.Count> 1)
{
    doc.FirstSection.AppendContent((Section) doc.FirstSection.NextSibling);
    doc.FirstSection.NextSibling.Remove();
}
doc.SaveOptions.ExportPrettyFormat = true;
doc.Save(@"out.html");

Hope this helps.
Best regards.

Hello!

Thank you for your thoughtful post and your patience. I’m Viktor, the developer responsible for HTML export. Here are some considerations regarding the issue with section breaks.

We usually pay attention to XHTML compliance and other compatibility issues. Especially this is needed when exporting to IDPF EPUB format. Please note that HTML output with SaveOptions.HtmlExportXhtmlTransitional is transitional but not strict XHTML. Let’s consider problems we meet and think about possible solutions.

There are four types of breaks we need to export:

  • line break;
  • page break;
  • section break;
  • column break.

First two of them are fully compliant in both Aspose.Words and Microsoft Word HTML export. Line break is simply <br /> (no attributes) inside a paragraph (<p>). This is pretty legal even in strict XHTML and must be well understood by all visual agents. Page break is also good since it occurs inside <p> and doesn’t have any Microsoft specific attributes. In rare cases people would like to replace line breaks with paragraph starts or suppress page breaks. But these are very minor requirements. Let’s leave them as they are output now.

Section breaks are not so easy. First of all, sections are represented in output HTML as <div> elements and breaks should occur on the same outer level, as children of <body>. Microsoft Word outputs much like Aspose.Words does but also encloses <br> into <span>. Both choices are not strict XHTML compliant. As you noted we can emit a fictive paragraph (<p>) to enclose the break and remain compliant. But even having explicit zero margins, paragraph is not good for this purpose, at least because it’s a block-level element. In IDPF EPUB export we met the same problem since it requires strict XHTML. So we needed to output section breaks somehow to be still compliant. We removed <br> element at all and put its CSS properties right to the subsequent <div> element. Theoretically this should be treated as a break but Microsoft Word up to version 2007 doesn’t recognize such breaks:

…
<div>
<p style="margin:0pt"><span style="font-family:serif; font-size:12pt">Section 1</span></p>
</div>
<div style="`page-break-before:always; clear:both; mso-break-type:section-break`">
<p style="margin:0pt"><span style="font-family:serif; font-size:12pt">Section 2</span></p>
</div>
…

This construction can be considered for export and import. But since Aspose.Words is primarily oriented to processing Microsoft Word documents we need to think about bidirectional roundtrip with this application. Of course we cannot roundtrip every feature but such things are always in focus.

Another issue you pointed is using Microsoft Word specific attributes such as mso-break-type and mso-column-break-before to export section breaks and column breaks. Our general policy is to avoid Microsoft “magic” wherever possible. Most people don’t like it in their HTML. In some cases we cannot roundtrip features without huge “magic” overhead and simply sacrifice roundtrip to HTML purity. In other cases we provide options, for instance SaveOptions.HtmlExportDocumentProperties, to enable some “magic”. By default all these options are disabled. I agree with you that Microsoft specific “magic” should be optional in HTML and disabled by default. But section breaks and column breaks probably need more than just Boolean options. This could bring some complexity.

Strictly speaking non-standard CSS attributes should not be a problem. CSS language states that readers (visual agents) must ignore any unknown constructions if only they have valid general syntax. In particular, any unknown attribute must be ignored. This is assumed to allow any extensions including forward compatibility. Please share with us how you check CSS style sheets and why you need this check.

How can we output section breaks according to our discussion? Here are the choices:

  1. As we do now or as Microsoft Word does – incompliant but recognized by Microsoft Word.
  2. As we do in IDPF EPUB export – vice versa: strict XHTML but not recognized by Microsoft Word.
  3. Filtering “magic”. Anyhow but without “mso-break-type:section-break” – this will make section break from new page just a simple page break. Sections themselves will be preserved.
  4. As page breaks for section break from new page and ignore for others. In this case it’s reasonable to join all sections in output. (The question is what to do with page setup if it’s also output – SaveOptions.HtmlExportPageSetup).
  5. Always ignore. Joining sections is also reasonable but questionable.

For column breaks we can restrict to one Boolean option. Currently column breaks are output but section properties or page setup don’t tell how many columns should be in particular section. Boolean option should probably provide both column count and all column breaks. If also requested, page setup should optionally have the following attribute:

@page Section1
{
…
`mso-columns:2 even 36.0pt`;
…
}
div.Section1
{page:Section1;}

This is not implemented. So any column break currently has page break semantics. We’ll consider full support in the future.

As you can see there are pretty many degrees of freedom and unanswered questions. That’s why we haven’t yet done this parameterization. After all we’ll likely provide our own roundtrip for any possible output. This will bring unjustified complexity. You see, that’s easier to have only one algorithm for export and import. It’s how things are done in Microsoft Word: only one export option to filter out all “magic”.

Please share any ideas on how we can deal with breaks. Your feedback is very important for us. Thank you in advance!

Regards,

Thank you for the detailed response.

New information… The Outlook and Word hang occurs when a non-trivial HTML document contains a section-break, the <body> fragment of that HTML is extracted and then embedded into a table cell within another non-trivial HTML document, and then that resulting HTML is opened by Word or sent in an email and viewed by Outlook, which then hangs at 100% CPU usage.

Using Microsoft Word alone, I cannot create a new document with a section-break within a table, so I assume that is an unsupported scenario and is ultimately why Word/Outlook fail when attempting to render HTML that contains a section-break within a table.

I tried removing the <br ... section-break...> tag and moving its style attribute to the following <div> tag but Word did not recognize the section break like that, which is why I proposed wrapping the
with a <p>, which Word does recognize.

For now, I have added some post-processing to strip out the section-break style and wrap the corresponding element with a <p> element. That is working fine, but is somewhat fragile since it relies on Regex pattern matching. The next step is to redesign the HTML that this HTML fragment is being embedded within. Ultimately, it would be nice to have some more control over Microsoft-specific output styles.

Hello again!
Thank you for your feedback. Section breaks inside tables have no sense but in some weird documents they occur. Of course you cannot insert them with Microsoft Word – it never produces invalid documents. But you can obtain them from third-party applications/sources or create manually. If Outlook or Word hang on opening some documents we cannot help anyhow. This should be either worked-around or reported to Microsoft.
Aspose.Words tries to open invalid documents either fixing some minor problems silently or throwing exceptions. For instance if a section break occurs inside a table it correctly gives the following exception:
Aspose.Words.FileCorruptedException: The document appears to be corrupted and cannot be loaded. —> System.InvalidOperationException: Cannot insert the requested break inside a table.
If you meet any documents that hang Aspose.Words please report to us. Any hang is considered a defect and should be addressed.
Moving attributes specific to section break to corresponding
element won’t work in Outlook or Word as I already wrote. If wrapping breaks into paragraphs is good for you that’s okay. The only possible risk with post-processing is that future versions of Aspose.Words might output HTML differently and assumptions programmed with regular expression won’t work. For this case that’s good to develop some unit tests in your project and execute them after every upgrade.
Regards,

I think something like option #3 that you described on 2009-11-25 would be ideal. Provide a single Boolean option (HtmlExportStrictCSS or HtmlExportMicrosoftStyles) that is the master controller of whether any Microsoft styles are output, so that you have the choice of 1) generating HTML that can round-trip back to Word, or 2) generating HTML that will pass the W3C CSS Validation Service without warnings or errors.

The W3C CSS Validation Service reports the mso-break-type style as an error. Since HTML renderers ignore styles they don’t understand, it is tempting to think of this as simply a warning, but my issue is that the HTML output produced by Aspose is later inserted into an HTML template and the non-standard style in an unsupported context causes Word/Outlook to hang. When the non-standard style is removed, Word/Outlook render the document successfully. It is difficult to know whether any other non-standard styles have problems in other contexts, but it would be much safer if Aspose could be configured to produce an HTML document that did not contain any non-standard styles.

Thank you. We’ll consider this improvement. But to be compliant with XHTML Strict and CSS 2.1, combination of 2 and 3 should be applied. Since there are conflicting objectives we should think more about this. If the problem was easier we would have already provided the solution.
“Master controller” option would be good if other options utilizing Microsoft “magic” were enabled by default. By our policy default is more compatibility and probably less features. This also prevents jamming output HTML with features you don’t need in most cases.
Regards,