(Paid users)DOCXToHTML, style problem after Extract Content Between Nodes in a Document

DiZheng · November 7, 2022, 3:03am

Hi,
after Extract Content Between Nodes in a Document, the css style has issue. The margin bottom of the p and table tags is changed to 0pt.
Reference Documents:Extract Content Between Document Nodes|Aspose.Words for .NET

public void Get()
{
    try
    {
        HtmlSaveOptions options = new HtmlSaveOptions();
        options.ExportRoundtripInformation = true;
        options.ExportImagesAsBase64 = true;
        options.CssStyleSheetType = CssStyleSheetType.External;
        StreamReader sr = new StreamReader("C://wordtohtml//demo.docx", Encoding.Default);
        Stream stream = sr.BaseStream;
        LoadOptions loadOptions = new LoadOptions
        {
            WarningCallback = new DocumentLoadingWarningCallback(_logger)
        };
        Document doc = new Document(stream, loadOptions);
        doc.Save("C://wordtohtml//demo.html", options);
        //Close the stream now, it is no longer needed because the document is in memory.
        stream.Close();
        ArrayList head2list = new ArrayList();
        var heading2 = doc
            .GetChildNodes(NodeType.Paragraph, true)
            .Cast<Paragraph>()
            .ToArray()
            .Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading2);

        foreach (var head2 in heading2)
        {
            head2list.Add(head2);
        }
        // get extractedNodes
        List<Node> pprList = _asposeService.ExtractContent((Node)head2list[4], (Node)head2list[5], false);
        Document pprDoc = _asposeService.GenerateDocument(doc, pprList);
        pprDoc.Save("C://wordtohtml//ppr.html", options);
    }
    catch (Exception e)
    {
    }
}
public List<Node> ExtractContent(Node startNode, Node endNode, bool isInclusive)
{
    // First, check that the nodes passed to this method are valid for use.
    VerifyParameterNodes(startNode, endNode);

    // Create a list to store the extracted nodes.
    List<Node> nodes = new List<Node>();

    // If either marker is part of a comment, including the comment itself, we need to move the pointer
    // forward to the Comment Node found after the CommentRangeEnd node.
    if (endNode.NodeType == NodeType.CommentRangeEnd && isInclusive)
    {
        Node node = FindNextNode(NodeType.Comment, endNode.NextSibling);
        if (node != null)
            endNode = node;
    }

    // Keep a record of the original nodes passed to this method to split marker nodes if needed.
    Node originalStartNode = startNode;
    Node originalEndNode = endNode;

    // Extract content based on block-level nodes (paragraphs and tables). Traverse through parent nodes to find them.
    // We will split the first and last nodes' content, depending if the marker nodes are inline.
    startNode = GetAncestorInBody(startNode);
    endNode = GetAncestorInBody(endNode);

    bool isExtracting = true;
    bool isStartingNode = true;
    // The current node we are extracting from the document.
    Node currNode = startNode;

    // Begin extracting content. Process all block-level nodes and specifically split the first
    // and last nodes when needed, so paragraph formatting is retained.
    // Method is a little more complicated than a regular extractor as we need to factor
    // in extracting using inline nodes, fields, bookmarks, etc. to make it useful.
    while (isExtracting)
    {
        // Clone the current node and its children to obtain a copy.
        Node cloneNode = currNode.Clone(true);
        bool isEndingNode = currNode.Equals(endNode);

        if (isStartingNode || isEndingNode)
        {
            // We need to process each marker separately, so pass it off to a separate method instead.
            // End should be processed at first to keep node indexes.
            if (isEndingNode)
            {
                // !isStartingNode: don't add the node twice if the markers are the same node.
                ProcessMarker(cloneNode, nodes, originalEndNode, currNode, isInclusive,
                    false, !isStartingNode, false);
                isExtracting = false;
            }

            // Conditional needs to be separate as the block level start and end markers, maybe the same node.
            if (isStartingNode)
            {
                ProcessMarker(cloneNode, nodes, originalStartNode, currNode, isInclusive,
                    true, true, false);
                isStartingNode = false;
            }
        }
        else
            // Node is not a start or end marker, simply add the copy to the list.
            nodes.Add(cloneNode);

        // Move to the next node and extract it. If the next node is null,
        // the rest of the content is found in a different section.
        if (currNode.NextSibling == null && isExtracting)
        {
            // Move to the next section.
            Section nextSection = (Section)currNode.GetAncestor(NodeType.Section).NextSibling;
            currNode = nextSection.Body.FirstChild;
        }
        else
        {
            // Move to the next node in the body.
            currNode = currNode.NextSibling;
        }
    }

    // For compatibility with mode with inline bookmarks, add the next paragraph (empty).
    if (isInclusive && originalEndNode == endNode && !originalEndNode.IsComposite)
        IncludeNextParagraph(endNode, nodes);

    // Return the nodes between the node markers.
    return nodes;
}
public Document GenerateDocument(Document srcDoc, List<Node> nodes)
{
    Document dstDoc = new Document();
    // Remove the first paragraph from the empty document.
    dstDoc.FirstSection.Body.RemoveAllChildren();

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);

    foreach (Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }

    return dstDoc;
}

DiZheng · November 7, 2022, 3:05am

demo.docx (30.1 KB)
Docx files I use

xujinyang · November 7, 2022, 3:11am

I have the same issue as well…

alexey.noskov · November 7, 2022, 7:42am

@xujinyang @DiZheng The problem occurs because in your code you use a new document as a target document for extracted content. This leads to the different styles in the source document and in the destination document. You can clone the original document and use it as a target document to resolve the problem:

public static Document GenerateDocument(Document srcDoc, List<Node> nodes)
{
    // Clone source document to preserve source styles.
    Document dstDoc = (Document)srcDoc.Clone(false);

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.UseDestinationStyles);

    // Put the section from the source document to retain original section page setup.
    dstDoc.AppendChild(importer.ImportNode(srcDoc.LastSection, true));

    // Remove all children from the impirted section.
    dstDoc.FirstSection.Body.RemoveAllChildren();

    foreach (Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }

    return dstDoc;
}

DiZheng · November 7, 2022, 9:49am

@alexey.noskov Thanks for very much for your help! Our problem has solved.

DiZheng · November 8, 2022, 3:24am

Hi @alexey.noskov

Do you know how to set the font of the converted html to ‘Graphic Regular’, or set the docx to the unified font ‘Graphic Regular’

alexey.noskov · November 8, 2022, 5:56am

@DiZheng You can use DocumentVisitor to achieve this. For example see the following code:

Document doc = new Document(@"C:\Temp\in.docx");
FontChanger changer = new FontChanger("Graphic Regular");
doc.Accept(changer);
doc.Save(@"C:\Temp\out.html");

class FontChanger : DocumentVisitor
{
    public FontChanger(string fontName)
    {
        mFontName = fontName;
    }

    /// <summary>
    /// Called when a FieldEnd node is encountered in the document.
    /// </summary>
    public override VisitorAction VisitFieldEnd(FieldEnd fieldEnd)
    {
        //Simply change font name
        ResetFont(fieldEnd.Font);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a FieldSeparator node is encountered in the document.
    /// </summary>
    public override VisitorAction VisitFieldSeparator(FieldSeparator fieldSeparator)
    {
        ResetFont(fieldSeparator.Font);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a FieldStart node is encountered in the document.
    /// </summary>
    public override VisitorAction VisitFieldStart(FieldStart fieldStart)
    {
        ResetFont(fieldStart.Font);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a Footnote end is encountered in the document.
    /// </summary>
    public override VisitorAction VisitFootnoteEnd(Footnote footnote)
    {
        ResetFont(footnote.Font);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a FormField node is encountered in the document.
    /// </summary>
    public override VisitorAction VisitFormField(FormField formField)
    {
        ResetFont(formField.Font);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a Paragraph end is encountered in the document.
    /// </summary>
    public override VisitorAction VisitParagraphEnd(Paragraph paragraph)
    {
        ResetFont(paragraph.ParagraphBreakFont);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a Run node is encountered in the document.
    /// </summary>
    public override VisitorAction VisitRun(Run run)
    {
        ResetFont(run.Font);
        return VisitorAction.Continue;
    }

    /// <summary>
    /// Called when a SpecialChar is encountered in the document.
    /// </summary>
    public override VisitorAction VisitSpecialChar(SpecialChar specialChar)
    {
        ResetFont(specialChar.Font);
        return VisitorAction.Continue;
    }

    private void ResetFont(Aspose.Words.Font font)
    {
        font.Name = mFontName;
    }

    private string mFontName = "Arial";
}

DiZheng · November 9, 2022, 10:19am

Hi @alexey.noskov
I used the method you provided before, There has a problem with the bookmark name in documents after convert to html.
The bookmark name Bookmark3 was forced to Bookmark3_0.
"<p><a name="Bookmark3_0"><span>1.XXXXXXX</span></a></p>"

This function display Bookmark3_0:

public static Document GenerateDocument(Document srcDoc, List<Node> nodes)
{
    // Clone source document to preserve source styles.
    Document dstDoc = (Document)srcDoc.Clone(false);

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.UseDestinationStyles);

    // Put the section from the source document to retain original section page setup.
    dstDoc.AppendChild(importer.ImportNode(srcDoc.LastSection, true));

    // Remove all children from the impirted section.
    dstDoc.FirstSection.Body.RemoveAllChildren();

    foreach (Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }

    return dstDoc;
}

This function display Bookmark3, it’s right,but margin bottom will changed to 0pt:

public Document GenerateDocument(Document srcDoc, ArrayList nodes)
{
    // Create a blank document.
    Document dstDoc = new Document();
    // Remove the first paragraph from the empty document.
    dstDoc.FirstSection.Body.RemoveAllChildren();

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);

    foreach (Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }

    // Return the generated document.
    return dstDoc;
}

alexey.noskov · November 9, 2022, 1:49pm

@DiZheng The problem occurs because MS Word does not allow duplicated bookmark name and NodeImporter tries to fix this by renaming bookmarks. In the GenerateDocument method the bookmark is added twice, the first time in this line, when whole section is imported and added:

dstDoc.AppendChild(importer.ImportNode(srcDoc.LastSection, true));

and second time when particular node is imported and added.

You can fix this by importing section without children. Please try modifying your code like the following:

public static Document GenerateDocument(Document srcDoc, List<Node> nodes)
{
    // Clone source document to preserve source styles.
    Document dstDoc = (Document)srcDoc.Clone(false);

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.UseDestinationStyles);

    // Put the section from the source document to retain original section page setup.
    dstDoc.AppendChild(importer.ImportNode(srcDoc.LastSection, false));

    // Add an empty body.
    dstDoc.FirstSection.AppendChild(new Body(dstDoc));

    foreach (Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }

    return dstDoc;
}

DiZheng · November 10, 2022, 1:43am

Thanks @alexey.noskov, i’ll try this method.