Font size is not retained in the Extracted HTML from Docx files

I am working on extracting sections from a Word document (DOCX) as HTML using Aspose.Words in C#. While iterating through nodes in the document section, I need to save the collected content as HTML. Though my document contain font size as 11pt, the extracted HTML contain 12pt. I want to retain the same formats as present in the input document.

// Load the document
Document doc = new Document(filename);
doc.UpdateFields();
// Create an HTML save options object to configure the HTML output
HtmlSaveOptions saveOptions = new HtmlSaveOptions
{
     ExportHeadersFootersMode = ExportHeadersFootersMode.None,
     PrettyFormat = true,
     ExportImagesAsBase64 = true
};
 
foreach (Node node in section1.Body)
{
    if (node == pointer1)
        isCollecting = true;

    if (isCollecting)
    {
        if (firstNodeCollected)
        {
            if (node.NodeType == NodeType.Paragraph)
            {
                Paragraph paragraph = (Paragraph)node;
                foreach (Bookmark bookmarkStart in paragraph.Range.Bookmarks)
                {
                    var bookmarktext = bookmarkStart.Text.Trim();
                    if (bookmarktext.Contains("\r"))
                    {
                        string[] lines = bookmarktext.Split(new[] { "\r", "\n", "\r\n" }, StringSplitOptions.None);
                        bookmarktext = lines[lines.Length - 1];
                    }
                    var templist = lst.Where(x => x.Name == bookmarktext).ToList();
                    if (bookmarkStart.Name.StartsWith("_Toc") && bookmarkStart.BookmarkStart != bm2.BookmarkStart && HeaderText != bookmarkStart.Text.Trim() && templist.Count != 0)
                    {
                        isCollecting = false;
                        break;
                    }
                }

                if (!isCollecting)
                    break;
            }
            if (node.NodeType == NodeType.Paragraph)
            {
                Paragraph para = (Paragraph)node;
                ReplaceTextInRuns(para);
            }
            else if (node.NodeType == NodeType.Table)
            {
                Table table = (Table)node;
                foreach (Row row in table.Rows)
                {
                    foreach (Cell cell in row.Cells)
                    {
                        foreach (Paragraph para in cell.Paragraphs)
                        {
                            ReplaceTextInRuns(para);
                        }
                    }
                }
            }
            tempSection.Body.AppendChild(tempDoc.ImportNode(node, true));
        }
        else
        {
            firstNodeCollected = true;
        }
    }
}

string htmlContent;
using (MemoryStream htmlStream = new MemoryStream())
{
    tempDoc.Save(htmlStream, saveOptions);
    htmlContent = Encoding.UTF8.GetString(htmlStream.ToArray());
}

@suresh1995

To retain the original font size when extracting HTML from a DOCX file using Aspose.Words in C#, you can utilize the HtmlSaveOptions class with specific settings. It appears that the default behavior may not preserve the font size accurately, so you might need to enable certain options to ensure that the extracted HTML reflects the original formatting.

Here’s a code snippet that demonstrates how to set up the HtmlSaveOptions to potentially resolve the issue:

// Load the document
Document doc = new Document(filename);

// Update fields in the document
doc.UpdateFields();

// Create an HTML save options object
HtmlSaveOptions saveOptions = new HtmlSaveOptions
{
    ExportHeadersFootersMode = ExportHeadersFootersMode.None,
    PrettyFormat = true,
    ExportImagesAsBase64 = true,
    ExportFontResources = true, // Ensure font resources are exported
    ResolveFontNames = true // Resolve font names to maintain formatting
};

// Save the document to HTML format
using (MemoryStream htmlStream = new MemoryStream())
{
    doc.Save(htmlStream, saveOptions);
    string htmlContent = Encoding.UTF8.GetString(htmlStream.ToArray());
}

In this example, the ExportFontResources option is set to true, which allows the fonts used in the document to be included in the output. Additionally, the ResolveFontNames option helps in maintaining the correct font names and sizes in the extracted HTML.

If you continue to experience issues with font sizes not matching, consider checking the original document for any styles or formatting that might be affecting the output. You may also want to explore other properties of HtmlSaveOptions that could influence the rendering of text.

For further details, you can refer to the official documentation on [HtmlSaveOptions](https://github.com/aspose-words/Aspose.Words-for-.NET/tree/master/Examples/DocsExamples/DocsExamples/File Formats and Conversions/Save Options/Working with HtmlSaveOptions.cs).

If the problem persists, please provide additional details about the document structure or any specific styles that might be causing the discrepancy.

@suresh1995 Could you please attach your input and output documents here for our reference? We will check the issue and provide you more information. Also, actually, to convert a section to HTML you can use the following code:

Document doc = new Document(@"C:\Temp\in.doc");
// Get section.
Section sect = doc.FirstSection;
// Convert section to HTML
string html = sect.ToString(SaveFormat.Html);

Font file cannot be written to disk. When saving to a stream or to a string either FontsFolder should be specified, or custom streams should be provided via FontSavingCallback, or ExportFontsAsBase64 should be set to true. Please see documentation for details.

@suresh1995 You should to exactly what is described in the exception message. Please try using the following code:

Document doc = new Document(@"C:\Temp\in.doc");
// Get section.
Section sect = doc.FirstSection;
// Convert section to HTML
HtmlSaveOptions opt = new HtmlSaveOptions();
opt.ExportFontsAsBase64 = true;
opt.ExportImagesAsBase64 = true;
string html = sect.ToString(opt);

sharing a basic test doc
test doc.docx (12.3 KB)

@suresh1995 Thank you for additional information. The problem is not reproducible using the following simple code:

Document doc = new Document(@"C:\Temp\in.docx");
doc.Save(@"C:\Temp\out.html", new HtmlSaveOptions() { PrettyFormat = true });

the problem on your side might occur because you are creating tempDoc from scratch. For example you can reproduce the problem using the following code:

Document doc = new Document(@"C:\Temp\in.docx");
Document tmpDoc = new Document();
tmpDoc.RemoveAllChildren();
tmpDoc.AppendChild(tmpDoc.ImportNode(doc.FirstSection, true));
tmpDoc.Save(@"C:\Temp\out.html", new HtmlSaveOptions() { PrettyFormat = true });

To avoid this you can clone original document instead of creating document from scratch:

Document doc = new Document(@"C:\Temp\in.docx");
Document tmpDoc = (Document)doc.Clone(false);
tmpDoc.RemoveAllChildren();
tmpDoc.AppendChild(tmpDoc.ImportNode(doc.FirstSection, true));
tmpDoc.Save(@"C:\Temp\out.html", new HtmlSaveOptions() { PrettyFormat = true });