Replace Text in Word Document with HTML with Option to Select Formatting inside HTML or of Document Builder C# .NET

I just did some further investigation, and it appears to be an issue with the HTML insert function. As best I can tell when using “InsertHtml” it removes all formatting present in the document at that point and uses the default document formatting.

You can see in the header its removed the bold and underline. Every merge field has used Times New Roman 12pt, even though the main content is
Times New Roman 10pt. Nothing is bold any more. And the footer has lost its italics, and had the font size increased.

Do you have any ideas for us?

As an aside, what is the performance going to be like using an event handler on every merge field, then creating a new document builder, then moving to the merge field? I’m a bit worried its going to be slow.

Hi Dale,

Thanks for the additional information. We are checking with this scenario and will get back to you as soon as possible.

Best Regards,

Hi Dale,

Thanks for your patience.

Dale:
If I was to guess what the issue is, my guess would be that when I try and insert HTML, but that HTML is a plain text string, your code wraps it in a or
and then applies some default formatting to it. And either this is new behaviour since V5 or the default format has changed.

Your guess is right; when you insert a plain string using DocumentBuilder.InsertHtml method, it wraps it inside
and then tags as follows. Here, you can see some default formatting is also specified:

this is a plain text

> ` `***Dale:** > I just did some further investigation, and it appears to be an issue with the HTML insert function. As best I can tell when using "InsertHtml" it removes all formatting present in the document at that point and uses the default document formatting. > > You can see in the header its removed the bold and underline. Every merge field has used Times New Roman 12pt, even though the main content is Times New Roman 10pt. Nothing is bold any more. And the footer has lost its italics, and had the font size increased.*

Starting from Aspose.Words v9.5.0, the behaviour of InsertHtml was changed. Now content inserted by Insert HTML does not inherit formatting specified in DocumentBuilder options. Whole formatting is taken from HTML snippet. If you insert HTML with no formatting specified, then default formatting is used for inserted content, e.g. if font is not specified in your HTML snippet, default font (Times New Roman) will be applied.

Dale:
As an aside, what is the performance going to be like using an event handler on every merge field, then creating a new document builder, then moving to the merge field? I’m a bit worried its going to be slow.

Please note that FieldMergingCallback event occurs during mail merge when a mail merge field is encountered in the document. Inserting HTML from inside this event is definitely a costly operation*.* On the other hand, creating a new document builder and moving directly to a merge field is comparatively less costly operation.

Best Regards,

Thanks Awais,

I appreciate you clearing up why this behaviour happens, so the important answer I need from you is how do I continue to use Aspose.Words to do what I need now that you have changed the behaviour in this way?

Our document merging system which is extensively used by our clients depends on being able to insert arbitrary HTML, where that HTML can be a plain string, without any additional formatting being added.

Wrapping a string within a paragraph seems wrong behaviour as a string would never be a block element by a paragraph is. And reverting to default formatting also seems wrong, as sometimes we insert small snippets which should always have the formatting of their parent, as the snippet may be used in multiple circumstances with different formatting in each. For example if we have one document in Arial, and another in Times New Roman then the merge field needs to display in the font of the document without having to set it for the merge field.

V5 of Aspose.Words did the job perfectly, maybe we could use V5 and V11 with V5 merging the document and V11 converting it to PDF - to be clear the only reason we upgraded to V11 was to be able to convert to PDF.

If its going to be too difficult then we’ll just have to return the upgrade.

Or would it maybe be possible for you to add back the old HTML insert function as a separate method so we can continue to use Aspose.Words?

And if you can see a way through this, could you provide a code snippet of your suggestion to speed the merge up? I assume you are implying a manual merge rather than using the automatic merge?

Thanks again,

Dale

Hi Dale,

Thanks for your patience. I think, you can use the following code snippet to be able to merge the formatting that is specified inside HTML string with the formatting of MergeField:

Document doc = new Document(@"C:\test\mf.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToMergeField("mf", false, false);
InsertHtmlWithBuilderFormatting(builder, "plain html string");
// InsertHtmlWithBuilderFormatting(builder, "formatted html string");
// Just to remove the mergefield
builder.MoveToMergeField("mf");
doc.Save(@"C:\test\out.docx");
public static void InsertHtmlWithBuilderFormatting(DocumentBuilder builder, string html)
{
    ArrayList nodes = new ArrayList();
    Document doc = builder.Document;
    // Store any callback already set on this document
    INodeChangingCallback origCallback = doc.NodeChangingCallback;
    // Stores nodes inserted during the InsertHtml call.
    doc.NodeChangingCallback = new HandleNodeChanging(nodes);
    // Some properties may be changed during InsertHTML, try using a brand new builder instead.
    DocumentBuilder htmlBuilder = new DocumentBuilder(doc);
    // Move to current paragraph of the original builder
    if (builder.CurrentParagraph != null)
        htmlBuilder.MoveTo(builder.CurrentParagraph);
    // Check if a specific inline node is selected move to this instead
    if (builder.CurrentNode != null)
        htmlBuilder.MoveTo(builder.CurrentNode);
    // Insert HTML.
    htmlBuilder.InsertHtml(html);
    // Restore the original callback
    doc.NodeChangingCallback = origCallback;
    // Go through every inserted node and copy formatting from the DocumentBuilder to the apporpriate nodes.
    foreach (Node node in nodes)
    {
        if (node.NodeType == NodeType.Run)
        {
            Run run = (Run)node;
            // Copy formatting of the builder's font to the font of the run.
            CopyFormatting(builder.Font, run.Font, htmlBuilder.Font);
        }
        else if (node.NodeType == NodeType.Paragraph)
        {
            Paragraph para = (Paragraph)node;
            // Copy formatting of the builder's paragraph and list formatting to the formatting of the paragraph.
            CopyFormatting(builder.ParagraphFormat, para.ParagraphFormat, htmlBuilder.ParagraphFormat);
            CopyFormatting(builder.ListFormat, para.ListFormat, htmlBuilder.ListFormat);
        }
        else if (node.NodeType == NodeType.Cell)
        {
            Cell cell = (Cell)node;
            // Copy formatting of the builder's cell formatting to the cell.
            CopyFormatting(builder.CellFormat, cell.CellFormat, htmlBuilder.CellFormat);
        }
        else if (node.NodeType == NodeType.Row)
        {
            Row row = (Row)node;
            // Copy formatting of the builder's row formatting to the row
            CopyFormatting(builder.RowFormat, row.RowFormat, htmlBuilder.RowFormat);
        }
    }
    // Move the original builder to where the temporary builder ended up
    if (htmlBuilder.CurrentParagraph != null)
        builder.MoveTo(htmlBuilder.CurrentParagraph);
    // Move to specific inline node if possible.
    if (htmlBuilder.CurrentNode != null)
        builder.MoveTo(htmlBuilder.CurrentNode);
}

public class HandleNodeChanging : INodeChangingCallback
{
    ArrayList mNodes;
    public HandleNodeChanging(ArrayList nodes)
    {
        mNodes = nodes;
    }
    void INodeChangingCallback.NodeInserted(NodeChangingArgs args)
    {
        mNodes.Add(args.Node);
    }
    void INodeChangingCallback.NodeInserting(NodeChangingArgs args)
    {
        // Do Nothing
    }
    void INodeChangingCallback.NodeRemoved(NodeChangingArgs args)
    {
        // Do Nothing
    }
    void INodeChangingCallback.NodeRemoving(NodeChangingArgs args)
    {
        // Do Nothing
    }
}

public static void CopyFormatting(Object source, Object dest, Object compare)
{
    if (source.GetType() != dest.GetType() && source.GetType() != compare.GetType())
        throw new ArgumentException("All objects must be of the same type");
    // Iterate through each property in the source object.
    foreach (PropertyInfo prop in source.GetType().GetProperties())
    {
        // Skip indexed access items. Skip setting the internals of a style as these should not be changed.
        if (prop.Name == "Item" || prop.Name == "Style")
            continue;
        object value;
        // Wrap this call as it can throw an exception. Skip if thrown
        try
        {
            value = prop.GetValue(source, null);
        }
        catch (Exception)
        {
            continue;
        }
        // Skip if value can not be retrieved.
        if (value != null)
        {
            // If this property returns a class which belongs to the 
            if (value.GetType().IsClass && prop.GetGetMethod().ReturnType.Assembly.ManifestModule.Name == "Aspose.Words.dll")
            {
                // Recurse into this class.
                CopyFormatting(prop.GetValue(source, null), prop.GetValue(dest, null), prop.GetValue(compare, null));
            }
            else if (prop.CanWrite)
            {
                // dest value != default dont copy
                if (prop.GetValue(dest, null).Equals(prop.GetValue(compare, null)))
                {
                    // If we can write to this property then copy the value across.
                    prop.SetValue(dest, prop.GetValue(source, null), null);
                }
            }
        }
    }
}

Moreover, I have attached the sample documents here for you to play with.

I hope, this will help.

Best Regards,

Hi Awais,

Thanks again for persisting with this. It does now work as expected, although there was one small error in your code below which should read NOT equals

else if (prop.CanWrite)
{
    // dest value != default dont copy
    if (**!**prop.GetValue(dest, null).Equals(prop.GetValue(compare, null)))
    {
        // If we can write to this property then copy the value across.
        prop.SetValue(dest, prop.GetValue(source, null), null);
    }
}

Now while this works as expected I am concerned about the complexity of the code, and the performance of it. It strikes me that you could easily change the internals of Aspose.Words for a future release which would render this code unusable. And the fact that it uses a builder per merge field and reflection does make me wonder how well this will scale up to batch jobs of hundreds or thousands of documents.

I assume you have very few other customers using this feature? But if there was any chance you could encapsulate this into the Aspose.Words component with a single method call interface and written for performance that would give us peace of mind because we will be expanding our usage of this in future.

Cheers,

Dale

Hi Dale,

Thanks for the additional information.

I have logged a new feature request in our issue tracking system and requested our development team to consider providing an option to be able to specify the formatting, whether taken from MergeField or from inside HTML, during inserting HTML content via DocumentBuilder.InsertHTML method. The issue ID is WORDSNET-6726. Your request has been linked to this issue and you will be notified as soon as this feature is available.

Secondly, yes the complexity of above mentioned code is pretty high. In case, you don’t want to take the formatting specified in HTML content at all during inserting HTML into your document, may be you can use the following code snippet then:

Document doc = new Document(@"C:\test\in.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToMergeField("testField");
string html = "html text with formatting";
MemoryStream htmlStream = new MemoryStream(Encoding.UTF8.GetBytes(html));
Words.LoadOptions options = new Words.LoadOptions();
options.LoadFormat = LoadFormat.Html;
Document htmlDoc = new Document(htmlStream, options);
MemoryStream textStream = new MemoryStream();
htmlDoc.Save(textStream, SaveFormat.Text);
builder.Write(Encoding.UTF8.GetString(textStream.GetBuffer()));
doc.Save(@"C:\test\out.docx");

Moreover, I have attached sample documents here with this post for you to play with.

I hope, this will help.

Best Regards,

Hi Dale,

Thanks for your inquiry.

You can find a much more cleaner version of the code posted by Awais as an attachment of the following forum post. It does however still use reflection like the rough version posted here.

The reason why this functionality was changed in the first place and this work around exists is because there are many problems that arise during the combination of HTML formatting and the document formatting. This is the reason why it was decided to take the formatting from HTML exclusively.

While the work around isn’t the best solution, it still should be reliable and fairly fast. We will consider adding an extra function to better mimic the old behavior of InsertHtml in a future version however we cannot provide any such ETA at the moment. Your request has been logged in our tracking system.

My suggestion is to first check whether you really need to use InsertHtml at all. Some users use this method even when they are just inserting plain text. If this is the case then you can simply translate the call into DocumentBuilder.WriteLn or which will inherit the formatting as you want.

Otherwise you will need to use the work around code for the time being. We apologise for any inconvenience.

Thanks,

Thanks guys… I appreciate your patience and assistance with this issue.

Hi Guys,

I’ve had time to look at this in more details and have a couple of questions to assist me going forward.

Q1) Using the code provided I assume this will always over-write all formatting applied within my merge data? As in every attribute will be copied whether set or not? e.g. if the outer formatting is Bold all the inner formatting will become bold, and if the outer formatting isn’t bold then the inner formatting will be all not-bold?

This may be very obvious, but wasn’t sure if that code only copies “set” attributes or just all of them.

Q2) Is there any way to tell that an attribute has been set vs whether it is a default. e.g.
"
Test
" and "
Test
" will both return the same font if the document font is also “times new roman” but is there any way to detect that the first instance inherited it from the document default and the second was deliberately set.

Q3) I am considering creating my own merge field system for various reasons. Is it possible to search for a Regular Expression and get a collection of results. e.g. If I want to find every instance of “[anything-in-here]” so would match “[Name]”, “[Address]”, “[Item(ItemName)]”. Then once I have this collection how do I replace each, is there a “MoveToText” method? And then I guess the insert is the same as normal. And then remove the original text somehow?

Thanks again for all your assistance.

Cheers,

Dale

Hi Dale,

Thanks for your inquiry

1) This code will only copying over font formatting from the DocumentBuilder when there is no formatting for that member provided with the HTML. I’m afraid I’m not 100% sure of what you meant by inner and outer formatting in this case. It would be great if you could clarify this.

2) I’m afraid there is no way to tell these two situations apart, both are loaded into the model in the same way and displayed as the font as Times New Roman.

3) You can use Range.Replace to fin and replace text. Please see the second code example under Find and Replace Overview in the documentation which uses a Regex to match text.

If you find you need to have more control over how the text is replaced then you may need to use the replace evaluator overload as described here.

If we can help with anything else, please feel free to ask.

Thanks,

Thanks Adam, appreciate your quick response - especially since I’m just down the road so to speak :).

With regard to answer 1, I’m probably missing something, but you say it only copies the formatting if none exists for the HTML. But how can you tell that? The code only seems to check whether the formatting is different before applying itself. (I haven’t checked your neater solution yet though).

You can probably tell from my questions that ultimately I’m trying to work out if I can use the merge field formatting AND the inserted HTML formatting which would be the ideal. But if the copy is all or nothing then maybe I have to provide either all formatting or no formatting on my inserted HTML?

Its hard to explain so let me give an example. If I have a merge field <> and I make that bold. I would expect an content I replace that merge field with to get the same font and font-size as the merge field is and also be bold. The solution you have present works because it copies whatever formatting the merge field has to every inserted node.

But what about if the content I insert is
Hello, this is an italic test
. Now I would expect the inserted content to be the same font, font-size, bold and the word italic to be italic.

But because we are copying all the formatting from the merge field won’t it over-right the italic?

Hi Dale,

You’re very welcome. I was admiring your user name by the way

Actually both code implementations compare a third variable, this is the default formatting (this is easier to see in the cleaner version of the code which I highly suggest to use). The compare parameter contains the default formatting of text inserted into the document. If the HTML formatting differs from the default then nothing is copied over. This means the builder and HTML formatting is combined and there shouldn’t be any formatting being overridden.

So this code should achieve what you are looking for (a combination of both formatting) and not an all or nothing scenario. In your example the output will be correct (the entire sentence bold and the one word italic).

Likewise if you had the mergefield font color set to red and the HTML snippet had green colored font then the resulting inserted text would be green. If there is there is HTML formatting then it it not overridden by the builder formatting.

I do apologise that such functionality requires a work around, we will take another crack at implementing such behavior again properly in a future version. Hopefully from this discussion you have seen just how complex two settings of formatting can be and why we made this move in the first place

Thanks,

Thanks again, yeah my username was funny/clever when I was younger and had just moved to the UK. Now being back in NZ its actually a little embarrassing but hey ho.

Thanks for being so patient with this issue. I do understand the complexity, I’ve written similar code myself in the past. Thinking back when we started with V5.5 I don’t think you used any of the HTML formatting at all, so things have definitely improved. And while the solution is a bit messy at least you’re sticking with us.

Now one (hopefully) final question, I think I understand how you are doing the compare to see whether the formatting was different to the default. I guess then that that leaves once small hole, where I might be intentionally setting the formatting of the inserted text to the same formatting as the default, and in such a case it would then look like the default and be over-written?

So for example Document default: Times New Roman
Inserted Test
Test
Merge field formatting: Arial

In this case the inserted formatting appears to be the default, will therefore change to Arial if I am not mistaken?

I’m sure I can work with this, just wanting to be clear.

*CoolKiwiBloke:
Thanks again for persisting with this. It does now work as expected, although there was one small error in your code below which should read NOT equals :slight_smile:

else if (prop.CanWrite)
{
// dest value != default dont copy
if (**!**prop.GetValue(dest, null).Equals(prop.GetValue(compare, null)))
{
// If we can write to this property then copy the value across.
prop.SetValue(dest, prop.GetValue(source, null), null);
}
}

Cheers,

Dale*

Ignore my code change - the compare was correct as it stood, it just needed the try/catch in case it failed.

*aske012:
Hi Dale,

Thanks for your inquiry

3) You can use Range.Replace to find and replace text. Please see the second code example under Find and Replace Overview in the documentation which uses a Regex to match text.

If you find you need to have more control over how the text is replaced then you may need to use the replace evaluator overload as described here.

If we can help with anything else, please feel free to ask.

Thanks,*

I’ve just had a shot at this, and its working fine, of course what I need to be able to do is insert HTML using the mechanisms we’re discussed. I have written some code that works, but I don’t think its very safe as it makes a few assumptions about the node provided being a run. However if its not a run I wouldn’t know how to handle things. Could you please run your eye over this code and suggest any changes to make it more robust and do things in the way you would normally do them.

DocumentBuilder builder = new DocumentBuilder(doc);
doc.Range.Replace(new Regex(@"\[.+?\]"), new MyReplaceEvaluator(builder), false);
public class MyTestReplaceEvaluator : IReplacingCallback
{
    private DocumentBuilder Builder { set; get; }

    public MyTestReplaceEvaluator(DocumentBuilder builder)
    {
        Builder = builder;
    }

    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        string MergeFieldData = "*Test*";
        Node currentNode = e.MatchNode;
        if (e.MatchOffset > 0)
        {
            SplitRun(e.MatchNode as Run, e.MatchOffset);
            currentNode = e.MatchNode.NextSibling;
        }
        Builder.MoveTo(currentNode); // Insert before merge field
        new DocumentBuilderHelper(Builder).InsertHtmlWithBuilderFormatting(MergeFieldData);
        (currentNode as Run).Text = (currentNode as Run).Text.Replace(e.Match.ToString(), "");
        return ReplaceAction.Skip; // We've handled it ourselves
    }

    private Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring(0, position);
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

Hi Dale,

Thanks for this additional information.

I believe you are correct with your assumption that the Arial formatting will override the default Times New Roman in that particular case. From memory, this may well have been how the InsertHtml behavior worked in Aspose.Words 8.X before the changes to the way the formatting is applied however I cannot be 100% sure on that. Hopefully this doesn’t cause too much of a problem.

Regarding your replace code, it looks pretty good. You should only ever expect Run nodes to be sent to the callback so it should work fine. However note that your matching text maybe made up of more than one run, which would mean you may need to use the more elaborate way of splitting runs as found in the example code. This depends on your expected input documents though.

Please let me know if I can help with anything else.

Thanks,

Thanks again, that should keep me busy for a while now :slight_smile:

The issues you have found earlier (filed as WORDSNET-6726) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(1)