Save DOC File to HTML | Extract only Headers Footers from Word Document into Separate HTML Files using C# .NET

manjunath.patil · October 2, 2020, 8:44am

Hi

I am evaluating Aspose library for converting DOC file to HTML in C#. As per my need i want to extract only header and footer into separate HTML file. So is it possible to extract only header/Footer into either doc file or HTML file?
Note: Its not only header/footer text but I need to have exact placements of text/image or any other fields.

Thanks in advance for any help

awais.hafeez · October 2, 2020, 1:44pm

@manjunath.patil,

The following C# code of Aspose.Words for .NET API will extract Header Footer contents from all Sections in Word document and then save them to separate DOC and HTML files:

Document doc = new Document(@"C:\temp\in.docx");

Document header_Footer_Document = (Document)doc.Clone(false);
header_Footer_Document.RemoveAllChildren();
header_Footer_Document.EnsureMinimum();
DocumentBuilder documentBuilder = new DocumentBuilder(header_Footer_Document);

foreach (Section sec in doc.Sections)
{
    foreach (HeaderFooter headerFooter in sec.HeadersFooters)
    {
        foreach (Node node in headerFooter.ChildNodes)
            header_Footer_Document.LastSection.Body.AppendChild(header_Footer_Document.ImportNode(node, true));
    }

    documentBuilder.MoveToDocumentEnd();
    documentBuilder.InsertBreak(BreakType.SectionBreakNewPage);
}

header_Footer_Document.Save(@"C:\temp\output.doc");
header_Footer_Document.Save(@"C:\temp\output.html");

manjunath.patil · October 5, 2020, 12:49pm

Hi Awais Hafeez

Thanks alot for the code, it works fine to extract header and footer.
I need to extract body also in separate html file, is that possible?
It means given a doc file i should be able to generate 3 separate files (Header, Body, Footer)
Thanks in advance for the help.

with regards
Manjunath

awais.hafeez · October 5, 2020, 4:44pm

@manjunath.patil,

Yes, you can use the following C# code of Aspose.Words for .NET API to extract Body contents from all Sections in Word document and then save them to separate DOC and HTML files:

Document doc = new Document(@"C:\temp\in.docx");

Document body_Document = (Document)doc.Clone(false);
body_Document.RemoveAllChildren();
body_Document.EnsureMinimum();
DocumentBuilder documentBuilder = new DocumentBuilder(body_Document);

foreach (Section sec in doc.Sections)
{
    foreach (Node node in sec.Body.ChildNodes)
        body_Document.LastSection.Body.AppendChild(body_Document.ImportNode(node, true));

    documentBuilder.MoveToDocumentEnd();
    documentBuilder.InsertBreak(BreakType.SectionBreakNewPage);
}

body_Document.Save(@"C:\temp\output.doc");
body_Document.Save(@"C:\temp\output.html");

manjunath.patil · October 6, 2020, 2:26pm

Thank you very much, its working.

manjunath.patil · October 14, 2020, 2:33pm

Dear Hafeez

Please let me know how an remove page number from footer

with regards
Manjunath

awais.hafeez · October 15, 2020, 4:35am

@manjunath.patil,

The following code will remove Page and NumPages fields from Word document:

Document doc = new Document("C:\\temp\\input.docx");

foreach (Field field in doc.Range.Fields)
{
    if (field.Type == FieldType.FieldPage || field.Type == FieldType.FieldNumPages)
        field.Remove();
}

doc.Save("C:\\temp\\20.10.docx");

In case the problem still remains, please ZIP and upload your sample Word document and your expected document showing the desired behavior here for testing. You can create expected document manually using MS Word. We will then investigate the issue on our end and provide you more information.

awais.hafeez · October 20, 2020, 5:11am

A post was split to a new topic: Preserve Text Alignment in Table Cell during Word DOT to HTML Conversion using C# .NET

awais.hafeez · October 21, 2020, 3:59am

A post was split to a new topic: Calculate Header Footer Height or Distance in Word Document using C# .NET