Conversion from Microsoft Word document to HTML - Headers and Footers issue

Hello…

Regarding the conversion of Microsoft Word documents to HTML format, I have the following problem:

In my Word document there are headers and footers. Once I save the document as HTML using Apsose.Words, these headers and footers are encapsulated in div tags, which is fine, but I have no way to distinguish them from the rest of the document, such as body content, etc.

Is there any way for the subsequent parsing of the HTML to figure out if the current div is a header or footer? And if not, is there any plan for future versions of Aspose.Words to include some sort of characteristic attribute for headers and footers so that a parser can filter them out correctly?

Here is an example of a part of the HTML code that Aspose.Words creates after the Word document is saved as an HTML file:

    <div class="awdiv awpage" style="width:595.35pt; height:842pt;">
    	<div class="awdiv" style="left:56.7pt; top:42.55pt; clip:rect(0pt,538.65pt,55.6pt,-56.7pt);">
    		<div class="awdiv" style="clip:rect(0pt,0pt,1pt,0pt);"/>
    		<div class="awdiv" style="left:-5.4pt;">
    			<div class="awdiv" style="left:324.35pt; clip:rect(0pt,175.1pt,12.35pt,0pt);">
    				<div class="awdiv" style="left:5.4pt;">
    					<span class="awspan awtext001" style="font-size:8pt; left:124.26pt; top:1.34pt;">Text in Header </span>
    				</div>
    			</div>
    			<div class="awdiv" style="top:11.35pt;">
    				<div class="awdiv" style="left:166.45pt; clip:rect(0pt,157.9pt,12.35pt,0pt);">
    					<div class="awdiv" style="left:5.4pt;">
    						<span class="awspan awtext002" style="font-size:10pt; left:59.66pt; top:0.33pt;">Text in Header </span>
    					</div>
    				</div>
    				<div class="awdiv" style="left:324.35pt; clip:rect(0pt,175.1pt,12.35pt,0pt);">
    					<div class="awdiv" style="left:5.4pt;">
    						<span class="awspan awtext001" style="font-size:8pt; left:79.36pt; top:1.34pt;">Text in Header </span>
    					</div>
    				</div>
    			</div>
    			<div class="awdiv" style="top:22.7pt;">
    				<div class="awdiv" style="left:324.35pt; clip:rect(0pt,175.1pt,12.35pt,0pt);">
    					<div class="awdiv" style="left:5.4pt;">
    						<span class="awspan awtext001" style="font-size:8pt; left:138.96pt; top:1.34pt;">Text in Header </span>
    					</div>
    				</div>
    			</div>
    		</div>
    		<div class="awdiv" style="top:45.4pt;"/>
    	</div>
    	<div class="awdiv" style="left:56.7pt; top:804.45pt; clip:rect(0pt,538.65pt,10.2pt,-56.7pt);">
    		<span class="awspan awtext001" style="font-size:8pt; left:445.75pt; top:0.26pt;">Text in Footer </span>
    		<span class="awspan awtext002" style="font-size:8pt; left:466.21pt; top:0.26pt;">Page 2 </span>
    		<span class="awspan awtext001" style="font-size:8pt; left:470.66pt; top:0.26pt;"> of </span>
    		<span class="awspan awtext002" style="font-size:8pt; left:488pt; top:0.26pt;">3 </span>
    	</div>
    	<div class="awdiv" style="left:56.7pt; top:97.15pt;">
    		<div class="awdiv" style="top:95.77pt;">
    			<span class="awspan awtext003" style="color:#212529; left:0pt; top:0pt;">Text on Page</span>
    		</div>
    	</div>
    </div>

@mbertram This is an expected behavior. HtmlFixed format is indented for preserving only visual representation of the document, unfortunately, it is not possible to preserve original document structure when document is saved to HtmlFixed format.
We will consider exporting document structure to HtmlFixed format.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-24890

You can obtain Paid Support services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Thank you very much for your reply.

It would help me, with the current capabilities of Aspose.Words, to set a constant marker in the HTML code that I can use to uniquely identify the header and footer. Can you give me an example of how to do this with current capabilities?

@mbertram You can wrap header/footer into the bookmarks and then use the bookmarks as markers. For example see the following code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);

builder.MoveToHeaderFooter(HeaderFooterType.HeaderPrimary);
BookmarkStart start = builder.StartBookmark("PrimaryHeader_Start");
builder.EndBookmark(start.Name);
builder.Write("This is my coooll primary header!");
BookmarkStart end = builder.StartBookmark("PrimaryHeader_End");
builder.EndBookmark(end.Name);

builder.MoveToDocumentStart();
builder.Write("This is the document's main body.");
builder.InsertBreak(BreakType.PageBreak);
builder.Write("This is the second page of the document.");

HtmlFixedSaveOptions opt = new HtmlFixedSaveOptions();
opt.ExportEmbeddedCss = true;
opt.ExportEmbeddedFonts = true;
opt.ExportEmbeddedSvg = true;
opt.ExportEmbeddedImages = true;
opt.PrettyFormat = true;

doc.Save(@"C:\Temp\out.html", opt);

In this case header content will look like this:

<div class="awdiv" style="left:72pt; top:36pt; clip:rect(0pt,540pt,14.8pt,-72pt);">
	<a name="PrimaryHeader_Start" style="left:0pt; top:0pt;">
	</a>
	<span class="awspan awtext001" style="left:0pt; top:0.51pt; line-height:13.29pt;">This is my coooll primary header!</span>
	<a name="PrimaryHeader_End" style="left:163.62pt; top:0pt;">
	</a>
</div>

The above approach is good, but my problem is that the bookmarks do not span across the entire header or footer. I would need to find a way to write right before the end of the header.

Jumping to the beginning is possible. Jumping to the end of the header does not work. So what I would like to achieve is to put the cursor at the end of the header.

So my question is: How can I move the cursor to the end of the header?

@mbertram You can use code like this to mark existing headers/footers in your document:

Document doc = new Document(@"C:\Temp\in.docx");

// Get all headers/footers in the document.
NodeCollection headersFooters = doc.GetChildNodes(NodeType.HeaderFooter, true);

// Wrap headers/footers into bookmakrs
int counter = 0;
foreach(HeaderFooter headerFooter in headersFooters) 
{
    string bkNamePrefix = string.Format("{0}_{1}_", headerFooter.HeaderFooterType, counter++);

    BookmarkStart startStart = new BookmarkStart(doc, bkNamePrefix + "start");
    BookmarkEnd startEnd = new BookmarkEnd(doc, startStart.Name);

    BookmarkStart endStart = new BookmarkStart(doc, bkNamePrefix + "end");
    BookmarkEnd endEnd = new BookmarkEnd(doc, endStart.Name);

    headerFooter.PrependChild(startEnd);
    headerFooter.PrependChild(startStart);

    headerFooter.AppendChild(endStart);
    headerFooter.AppendChild(endEnd);
}

HtmlFixedSaveOptions opt = new HtmlFixedSaveOptions();
opt.ExportEmbeddedCss = true;
opt.ExportEmbeddedFonts = true;
opt.ExportEmbeddedSvg = true;
opt.ExportEmbeddedImages = true;
opt.PrettyFormat = true;

doc.Save(@"C:\Temp\out.html", opt);

Thanks for the example. I tested it with the following structure in the Word document:

The structure is as follows: At the top is the bookmark that marks the beginning of the header. In between is the paragraph with the header and at the end is the bookmark that marks the end of the header. The corresponding debug watch looks like this:

The HTML file, on the other hand, looks different than expected. The two bookmarks are rendered first and then the paragraph with the header follows. This means that the order in the HTML does not correspond to the order in the Word document:

This is exactly what leads to a problem in my code, which I can’t solve because I need to be able to rely on the expected order. This is exactly what leads to a problem in my code, which I can’t solve because I need to be able to rely on the expected order. It may be interesting to note that there are images in the header. Could this have an impact on rendering?

@mbertram Could you please attach your document here for testing? We will check it and provide you more information.

Sure… here is the Word document:

header_minimal_example.docx (120.6 KB)

And this is the according HTML file:

aspose.zip (24.9 KB)

@mbertram Thank you for additional information. The problem occurs because shape in your header is floating shape behind text. Unfortunately, as I already mentioned earlier, HtmlFixed format is indented for preserving only visual representation of the document and it is not possible to preserve original document structure when document is saved to HtmlFixed format.
I am afraid it will not be possible to mark header/footer like in your document example using the current Aspose.Words capabilities when save the document to HtmlFixed format. The proposed workaround will work only if the header/footer does not have floating content.

So, since I have no way to mark headers and footers, it would be nice if Aspose.Words would do this itself. All that would be needed is a small class attribute (which could be empty) appended to each object that comes from a header or footer.

It would also be nice if page numbers could be marked in the same way.

I see that there is an issue WORDSNET-24890 on this case. It would be great if my comments could be considered in this issue. For a future version of Aspose.Words this would be very important.

@mbertram Sure, I have added your comments to the defect description so they will be considered by the development team when work on this issue starts.

Regarding page numbers, you can mark them with bookmarks as I suggested earlier. Page numbers in MS Word documents are represented with { PAGE } fields, so you can insert bookmark before the field and after it.