Hello!
Thank you for your thoughtful post and your patience. I’m Viktor, the developer responsible for HTML export. Here are some considerations regarding the issue with section breaks.
We usually pay attention to XHTML compliance and other compatibility issues. Especially this is needed when exporting to IDPF EPUB format. Please note that HTML output with SaveOptions.HtmlExportXhtmlTransitional
is transitional but not strict XHTML. Let’s consider problems we meet and think about possible solutions.
There are four types of breaks we need to export:
- line break;
- page break;
- section break;
- column break.
First two of them are fully compliant in both Aspose.Words and Microsoft Word HTML export. Line break is simply <br />
(no attributes) inside a paragraph (<p>
). This is pretty legal even in strict XHTML and must be well understood by all visual agents. Page break is also good since it occurs inside <p>
and doesn’t have any Microsoft specific attributes. In rare cases people would like to replace line breaks with paragraph starts or suppress page breaks. But these are very minor requirements. Let’s leave them as they are output now.
Section breaks are not so easy. First of all, sections are represented in output HTML as <div>
elements and breaks should occur on the same outer level, as children of <body>
. Microsoft Word outputs much like Aspose.Words does but also encloses <br>
into <span>
. Both choices are not strict XHTML compliant. As you noted we can emit a fictive paragraph (<p>
) to enclose the break and remain compliant. But even having explicit zero margins, paragraph is not good for this purpose, at least because it’s a block-level element. In IDPF EPUB export we met the same problem since it requires strict XHTML. So we needed to output section breaks somehow to be still compliant. We removed <br>
element at all and put its CSS properties right to the subsequent <div>
element. Theoretically this should be treated as a break but Microsoft Word up to version 2007 doesn’t recognize such breaks:
…
<div>
<p style="margin:0pt"><span style="font-family:serif; font-size:12pt">Section 1</span></p>
</div>
<div style="`page-break-before:always; clear:both; mso-break-type:section-break`">
<p style="margin:0pt"><span style="font-family:serif; font-size:12pt">Section 2</span></p>
</div>
…
This construction can be considered for export and import. But since Aspose.Words is primarily oriented to processing Microsoft Word documents we need to think about bidirectional roundtrip with this application. Of course we cannot roundtrip every feature but such things are always in focus.
Another issue you pointed is using Microsoft Word specific attributes such as mso-break-type
and mso-column-break-before
to export section breaks and column breaks. Our general policy is to avoid Microsoft “magic” wherever possible. Most people don’t like it in their HTML. In some cases we cannot roundtrip features without huge “magic” overhead and simply sacrifice roundtrip to HTML purity. In other cases we provide options, for instance SaveOptions.HtmlExportDocumentProperties
, to enable some “magic”. By default all these options are disabled. I agree with you that Microsoft specific “magic” should be optional in HTML and disabled by default. But section breaks and column breaks probably need more than just Boolean options. This could bring some complexity.
Strictly speaking non-standard CSS attributes should not be a problem. CSS language states that readers (visual agents) must ignore any unknown constructions if only they have valid general syntax. In particular, any unknown attribute must be ignored. This is assumed to allow any extensions including forward compatibility. Please share with us how you check CSS style sheets and why you need this check.
How can we output section breaks according to our discussion? Here are the choices:
- As we do now or as Microsoft Word does – incompliant but recognized by Microsoft Word.
- As we do in IDPF EPUB export – vice versa: strict XHTML but not recognized by Microsoft Word.
- Filtering “magic”. Anyhow but without “
mso-break-type:section-break
” – this will make section break from new page just a simple page break. Sections themselves will be preserved.
- As page breaks for section break from new page and ignore for others. In this case it’s reasonable to join all sections in output. (The question is what to do with page setup if it’s also output –
SaveOptions.HtmlExportPageSetup
).
- Always ignore. Joining sections is also reasonable but questionable.
For column breaks we can restrict to one Boolean option. Currently column breaks are output but section properties or page setup don’t tell how many columns should be in particular section. Boolean option should probably provide both column count and all column breaks. If also requested, page setup should optionally have the following attribute:
@page Section1
{
…
`mso-columns:2 even 36.0pt`;
…
}
div.Section1
{page:Section1;}
This is not implemented. So any column break currently has page break semantics. We’ll consider full support in the future.
As you can see there are pretty many degrees of freedom and unanswered questions. That’s why we haven’t yet done this parameterization. After all we’ll likely provide our own roundtrip for any possible output. This will bring unjustified complexity. You see, that’s easier to have only one algorithm for export and import. It’s how things are done in Microsoft Word: only one export option to filter out all “magic”.
Please share any ideas on how we can deal with breaks. Your feedback is very important for us. Thank you in advance!
Regards,