We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Replace Text in Word Document with HTML with Option to Select Formatting inside HTML or of Document Builder C# .NET

We’ve just upgraded from v5 to v11 and its not working the same as it used to.


We use Aspose.Words to do a simple merge using code as follows which as you can see is inserting HTML into the merge field:

public void FieldMerging(FieldMergingArgs Args)
{
// Insert the text for this merge field as HTML data, using DocumentBuilder.
DocumentBuilder builder = new DocumentBuilder(Args.Document);
builder.MoveToMergeField(Args.DocumentFieldName);
builder.InsertHtml((String)Args.FieldValue);

// The HTML text itself should not be inserted as we have already inserted it as an HTML.
Args.Text = “”;
}

This has worked fine for years using v5 but using v11 it is screwing up the formatting in a number of ways.

I have attached the template, the document produced by v5 and the document produced by v11.

Could you please advise on how to get the format using v11 to be the same as it was using v5.

Thanks

Hi Dale,


Thanks for your inquiry. Could you please share your complete code including the HTML snippet, you want to insert in place of merge field, here for testing? I will investigate the issue on my side and provide you more information.

Best Regards,
In fact in this example, while we use the insert HTML option, the HTML for the 10 merge fields are all straight text, the text you can see that has been inserted. Hence why we're confused that the formatting has changed, because we haven't supplied any.

doc = new Aspose.Words.Document(memStream, new LoadOptions(Aspose.Words.LoadFormat.Doc, null, ResourcePath)); // Stream, BaseUri, Format, Password
doc.MailMerge.FieldMergingCallback = new DocumentMergeFieldHandler();

Hashtable Expansions = new Hashtable();

// Populate Expansions with HTML from database, too complex to show here, its stored as follows
// Key = "[" + merge field name + "]" e.g. [addressee], and value is the html, in this case Test Dale-3 Burrell-Sansha

String[] Results = new String[MergeFields.Length];

for (int i = 0; i < MergeFields.Length; i++)
{
Results[i] = (Expansions.Contains("[" + MergeFields[i].ToLower() + "]") ? (String)Expansions["[" + MergeFields[i].ToLower() + "]"] : "");
}

doc.MailMerge.Execute(MergeFields, Results);

using (MemoryStream memStreamSave = new MemoryStream())
{
doc.Save(memStreamSave, Aspose.Words.SaveFormat.Doc);

// Write data to database
}

Hi Dale,


Thanks for the additional information.

Please note that the MergeField event occurs during mail merge when a simple mail merge field is encountered in the document. You can respond to this event to return HTML text for the mail merge engine to insert into the document. For example, please try using the following code snippet to be able to mail merge HTML data into a document:

// The same approach can be used when merging HTML data from
database.

public void MailMergeInsertHtml()

{

Document doc = new Document(MyDir + "MailMerge.InsertHtml.doc");

// Add a handler for the MergeField event.

doc.MailMerge.FieldMergingCallback = new HandleMergeFieldInsertHtml();

// Load some Html from file.

StreamReader sr = File.OpenText(MyDir + "MailMerge.HtmlData.html");

string htmltext = sr.ReadToEnd();

sr.Close();

// Execute mail merge.

doc.MailMerge.Execute(new string[] { "htmlField1" }, new string[] { htmltext });

// Save resulting document with a new name.

doc.Save(MyDir + "MailMerge.InsertHtml Out.doc");

}

private class HandleMergeFieldInsertHtml : IFieldMergingCallback

{

///

/// This is called when merge field is actually merged with data in the document.

///

void IFieldMergingCallback.FieldMerging(FieldMergingArgs e)

{

// All merge fields that expect HTML data should be marked with some prefix, e.g. 'html'.

if (e.DocumentFieldName.StartsWith("html"))

{

// Insert the text for this merge field as HTML data, using DocumentBuilder.

DocumentBuilder builder = new DocumentBuilder(e.Document);

builder.MoveToMergeField(e.DocumentFieldName);

builder.InsertHtml((string)e.FieldValue);

// The HTML text itself should not be inserted.

// We have already inserted it as an HTML.

e.Text = "";

}

}

void IFieldMergingCallback.ImageFieldMerging(ImageFieldMergingArgs e)

{

// Do nothing.

}

}


I hope, this will help.

Best Regards,

Thanks Awais, I must be missing something, because your code looks to be doing exactly the same thing as mine, which is corrupting the formatting. Any ideas?

If I was to guess what the issue is, my guess would be that when I try and insert HTML, but that HTML is a plain text string, your code wraps it in a or

and then applies some default formatting to it. And either this is new behaviour since V5 or the default format has changed.


This is going to be a critical issue if we are unable to resolve it, as we must be able to maintain the original format of the template document when inserting merge fields, unless the merge field itself contains formatting which isn’t the case in this example.

I just did some further investigation, and it appears to be an issue with the HTML insert function. As best I can tell when using “InsertHtml” it removes all formatting present in the document at that point and uses the default document formatting.


You can see in the header its removed the bold and underline. Every merge field has used Times New Roman 12pt, even though the main content is
Times New Roman 10pt. Nothing is bold any more. And the footer has lost its italics, and had the font size increased.

Do you have any ideas for us?

As an aside, what is the performance going to be like using an event handler on every merge field, then creating a new document builder, then moving to the merge field? I’m a bit worried its going to be slow.

Hi Dale,


Thanks for the additional information. We are checking with this scenario and will get back to you as soon as possible.

Best Regards,

Hi Dale,


Thanks for your patience.
Dale:
If I was to guess what the issue is, my guess would be that when I try and insert HTML, but that HTML is a plain text string, your code wraps it in a or

and then applies some default formatting to it. And either this is new behaviour since V5 or the default format has changed.

Your guess is right; when you insert a plain string using DocumentBuilder.InsertHtml method, it wraps it inside

and then tags as follows. Here, you can see some default formatting is also specified:


<p style=“margin: 0pt”>

<span style="font-family: 'Times New Roman'; font-size: 12pt">this is a plain text</span>

</p>

Dale:
I just did some further investigation, and it appears to be an issue with the HTML insert function. As best I can tell when using "InsertHtml" it removes all formatting present in the document at that point and uses the default document formatting.

You can see in the header its removed the bold and underline. Every merge field has used Times New Roman 12pt, even though the main content is Times New Roman 10pt. Nothing is bold any more. And the footer has lost its italics, and had the font size increased.
Starting from Aspose.Words v9.5.0, the behaviour of InsertHtml was changed. Now content inserted by Insert HTML does not inherit formatting specified in DocumentBuilder options. Whole formatting is taken from HTML snippet. If you insert HTML with no formatting specified, then default formatting is used for inserted content, e.g. if font is not specified in your HTML snippet, default font (Times New Roman) will be applied.
Dale:
As an aside, what is the performance going to be like using an event handler on every merge field, then creating a new document builder, then moving to the merge field? I'm a bit worried its going to be slow.
Please note that FieldMergingCallback event occurs during mail merge when a mail merge field is encountered in the document. Inserting HTML from inside this event is definitely a costly operation. On the other hand, creating a new document builder and moving directly to a merge field is comparatively less costly operation.

Best Regards,

Thanks Awais,


I appreciate you clearing up why this behaviour happens, so the important answer I need from you is how do I continue to use Aspose.Words to do what I need now that you have changed the behaviour in this way?

Our document merging system which is extensively used by our clients depends on being able to insert arbitrary HTML, where that HTML can be a plain string, without any additional formatting being added.

Wrapping a string within a paragraph seems wrong behaviour as a string would never be a block element by a paragraph is. And reverting to default formatting also seems wrong, as sometimes we insert small snippets which should always have the formatting of their parent, as the snippet may be used in multiple circumstances with different formatting in each. For example if we have one document in Arial, and another in Times New Roman then the merge field needs to display in the font of the document without having to set it for the merge field.

V5 of Aspose.Words did the job perfectly, maybe we could use V5 and V11 with V5 merging the document and V11 converting it to PDF - to be clear the only reason we upgraded to V11 was to be able to convert to PDF.

If its going to be too difficult then we’ll just have to return the upgrade.

Or would it maybe be possible for you to add back the old HTML insert function as a separate method so we can continue to use Aspose.Words?

And if you can see a way through this, could you provide a code snippet of your suggestion to speed the merge up? I assume you are implying a manual merge rather than using the automatic merge?

Thanks again,

Dale

Hi Dale,


Thanks for your patience. I think, you can use the following code snippet to be able to merge the formatting that is specified inside HTML string with the formatting of MergeField:

Document doc = new
Document(@“C:\test\mf.docx”);

DocumentBuilder builder = new DocumentBuilder(doc);

builder.MoveToMergeField("mf", false, false);

InsertHtmlWithBuilderFormatting(builder, "plain html string");

//InsertHtmlWithBuilderFormatting(builder, "formatted html string");

// Just to remove the mergefield

builder.MoveToMergeField("mf");

doc.Save(@"C:\test\out.docx");


public static void InsertHtmlWithBuilderFormatting(DocumentBuilder builder, string html)

{

ArrayList nodes = new ArrayList();

Document doc = builder.Document;

// Store any callback already set on this document

INodeChangingCallback origCallback = doc.NodeChangingCallback;

// Stores nodes inserted during the InsertHtml call.

doc.NodeChangingCallback = new HandleNodeChanging(nodes);

// Some properties may be changed during InsertHTML, try using a brand new builder instead.

DocumentBuilder htmlBuilder = new DocumentBuilder(doc);

// Move to current paragraph of the original builder

if (builder.CurrentParagraph != null)

htmlBuilder.MoveTo(builder.CurrentParagraph);

// Check if a specific inline node is selected move to this instead

if (builder.CurrentNode != null)

htmlBuilder.MoveTo(builder.CurrentNode);

// Insert HTML.

htmlBuilder.InsertHtml(html);

// Restore the original callback

doc.NodeChangingCallback = origCallback;

// Go through every inserted node and copy formatting from the DocumentBuilder to the apporpriate nodes.

foreach (Node node in nodes)

{

if (node.NodeType == NodeType.Run)

{

Run run = (Run)node;

// Copy formatting of the builder's font to the font of the run.

CopyFormatting(builder.Font, run.Font, htmlBuilder.Font);

}

else if (node.NodeType == NodeType.Paragraph)

{

Paragraph para = (Paragraph)node;

// Copy formatting of the builder's paragraph and list formatting to the formatting of the paragraph.

CopyFormatting(builder.ParagraphFormat, para.ParagraphFormat, htmlBuilder.ParagraphFormat);

CopyFormatting(builder.ListFormat, para.ListFormat, htmlBuilder.ListFormat);

}

else if (node.NodeType == NodeType.Cell)

{

Cell cell = (Cell)node;

// Copy formatting of the builder's cell formatting to the cell.

CopyFormatting(builder.CellFormat, cell.CellFormat, htmlBuilder.CellFormat);

}

else if (node.NodeType == NodeType.Row)

{

Row row = (Row)node;

// Copy formatting of the builder's row formatting to the row

CopyFormatting(builder.RowFormat, row.RowFormat, htmlBuilder.RowFormat);

}

}

// Move the original builder to where the temporary builder ended up

if (htmlBuilder.CurrentParagraph != null)

builder.MoveTo(htmlBuilder.CurrentParagraph);

// Move to specific inline node if possible.

if (htmlBuilder.CurrentNode != null)

builder.MoveTo(htmlBuilder.CurrentNode);

}


public class HandleNodeChanging : INodeChangingCallback

{

ArrayList mNodes;

public HandleNodeChanging(ArrayList nodes)

{

mNodes = nodes;

}

void INodeChangingCallback.NodeInserted(NodeChangingArgs args)

{

mNodes.Add(args.Node);

}

void INodeChangingCallback.NodeInserting(NodeChangingArgs args)

{

// Do Nothing

}

void INodeChangingCallback.NodeRemoved(NodeChangingArgs args)

{

// Do Nothing

}

void INodeChangingCallback.NodeRemoving(NodeChangingArgs args)

{

// Do Nothing

}

}


public static void CopyFormatting(Object source, Object dest, Object compare)

{

if (source.GetType() != dest.GetType() && source.GetType() != compare.GetType())

throw new ArgumentException("All objects must be of the same type");

// Iterate through each property in the source object.

foreach (PropertyInfo prop in source.GetType().GetProperties())

{

// Skip indexed access items. Skip setting the internals of a style as these should not be changed.

if (prop.Name == "Item" || prop.Name == "Style")

continue;

object value;

// Wrap this call as it can throw an exception. Skip if thrown

try

{

value = prop.GetValue(source, null);

}

catch (Exception)

{

continue;

}

// Skip if value can not be retrieved.

if (value != null)

{

// If this property returns a class which belongs to the

if (value.GetType().IsClass && prop.GetGetMethod().ReturnType.Assembly.ManifestModule.Name == "Aspose.Words.dll")

{

// Recurse into this class.

CopyFormatting(prop.GetValue(source, null), prop.GetValue(dest, null), prop.GetValue(compare, null));

}

else if (prop.CanWrite)

{

// dest value != default dont copy

if (prop.GetValue(dest, null).Equals(prop.GetValue(compare, null)))

{

// If we can write to this property then copy the value across.

prop.SetValue(dest, prop.GetValue(source, null), null);

}

}

}

}

}


Moreover, I have attached the sample documents here for you to play with.

I hope, this will help.

Best Regards,

Hi Awais,


Thanks again for persisting with this. It does now work as expected, although there was one small error in your code below which should read NOT equals :slight_smile:

else if (prop.CanWrite)<o:p style="position: relative; "></o:p>

{<o:p style="position: relative; "></o:p>

// dest value != default dont copy<o:p style="position: relative; "></o:p>

if (!prop.GetValue(dest, null).Equals(prop.GetValue(compare, null)))<o:p style="position: relative; "></o:p>

{<o:p style="position: relative; "></o:p>

// If we can write to this property then copy the value across.<o:p style="position: relative; "></o:p>

prop.SetValue(dest, prop.GetValue(source, null), null);<o:p style="position: relative; "></o:p>

}<o:p style="position: relative; "></o:p>

}<o:p style="position: relative; "></o:p>


Now while this works as expected I am concerned about the complexity of the code, and the performance of it. It strikes me that you could easily change the internals of Aspose.Words for a future release which would render this code unusable. And the fact that it uses a builder per merge field and reflection does make me wonder how well this will scale up to batch jobs of hundreds or thousands of documents.

I assume you have very few other customers using this feature? But if there was any chance you could encapsulate this into the Aspose.Words component with a single method call interface and written for performance that would give us peace of mind because we will be expanding our usage of this in future.

Cheers,

Dale

Hi Dale,


Thanks for the additional information.

I have logged a new feature request in our issue tracking system and requested our development team to consider providing an option to be able to specify the formatting, whether taken from MergeField or from inside HTML, during inserting HTML content via DocumentBuilder.InsertHTML method. The issue ID is WORDSNET-6726. Your request has been linked to this issue and you will be notified as soon as this feature is available.

Secondly, yes the complexity of above mentioned code is pretty high. In case, you don’t want to take the formatting specified in HTML content at all during inserting HTML into your document, may be you can use the following code snippet then:

Document doc = new
Document(@“C:\test\in.docx”);

DocumentBuilder builder = new DocumentBuilder(doc);

builder.MoveToMergeField("testField");

string html = "html text with formatting";

MemoryStream htmlStream = new MemoryStream(Encoding.UTF8.GetBytes(html));

Words.LoadOptions options = new Words.LoadOptions();

options.LoadFormat = LoadFormat.Html;

Document htmlDoc = new Document(htmlStream, options);

MemoryStream textStream = new MemoryStream();

htmlDoc.Save(textStream, SaveFormat.Text);

builder.Write(Encoding.UTF8.GetString(textStream.GetBuffer()));

doc.Save(@"C:\test\out.docx");



Moreover, I have attached sample documents here with this post for you to play with.

I hope, this will help.

Best Regards,

Hi Dale,


Thanks for your inquiry.

You can find a much more cleaner version of the code posted by Awais as an attachment of the following forum post. It does however still use reflection like the rough version posted here.

The reason why this functionality was changed in the first place and this work around exists is because there are many problems that arise during the combination of HTML formatting and the document formatting. This is the reason why it was decided to take the formatting from HTML exclusively.

While the work around isn’t the best solution, it still should be reliable and fairly fast. We will consider adding an extra function to better mimic the old behavior of InsertHtml in a future version however we cannot provide any such ETA at the moment. Your request has been logged in our tracking system.

My suggestion is to first check whether you really need to use InsertHtml at all. Some users use this method even when they are just inserting plain text. If this is the case then you can simply translate the call into DocumentBuilder.WriteLn or which will inherit the formatting as you want.

Otherwise you will need to use the work around code for the time being. We apologise for any inconvenience.

Thanks,

Thanks guys… I appreciate your patience and assistance with this issue.

Hi Guys,


I’ve had time to look at this in more details and have a couple of questions to assist me going forward.

Q1) Using the code provided I assume this will always over-write all formatting applied within my merge data? As in every attribute will be copied whether set or not? e.g. if the outer formatting is Bold all the inner formatting will become bold, and if the outer formatting isn’t bold then the inner formatting will be all not-bold?

This may be very obvious, but wasn’t sure if that code only copies “set” attributes or just all of them.

Q2) Is there any way to tell that an attribute has been set vs whether it is a default. e.g.

Test

” and “

Test

” will both return the same font if the document font is also “times new roman” but is there any way to detect that the first instance inherited it from the document default and the second was deliberately set.

Q3) I am considering creating my own merge field system for various reasons. Is it possible to search for a Regular Expression and get a collection of results. e.g. If I want to find every instance of “[anything-in-here]” so would match “[Name]”, “[Address]”, “[Item(ItemName)]”. Then once I have this collection how do I replace each, is there a “MoveToText” method? And then I guess the insert is the same as normal. And then remove the original text somehow?

Thanks again for all your assistance.

Cheers,

Dale

Hi Dale,


Thanks for your inquiry

1) This code will only copying over font formatting from the DocumentBuilder when there is no formatting for that member provided with the HTML. I’m afraid I’m not 100% sure of what you meant by inner and outer formatting in this case. It would be great if you could clarify this.

2) I’m afraid there is no way to tell these two situations apart, both are loaded into the model in the same way and displayed as the font as Times New Roman.

3) You can use Range.Replace to fin and replace text. Please see the second code example under Find and Replace Overview in the documentation which uses a Regex to match text.

If you find you need to have more control over how the text is replaced then you may need to use the replace evaluator overload as described here.

If we can help with anything else, please feel free to ask.

Thanks,

Thanks Adam, appreciate your quick response - especially since I’m just down the road so to speak :).


With regard to answer 1, I’m probably missing something, but you say it only copies the formatting if none exists for the HTML. But how can you tell that? The code only seems to check whether the formatting is different before applying itself. (I haven’t checked your neater solution yet though).

You can probably tell from my questions that ultimately I’m trying to work out if I can use the merge field formatting AND the inserted HTML formatting which would be the ideal. But if the copy is all or nothing then maybe I have to provide either all formatting or no formatting on my inserted HTML?

Its hard to explain so let me give an example. If I have a merge field <> and I make that bold. I would expect an content I replace that merge field with to get the same font and font-size as the merge field is and also be bold. The solution you have present works because it copies whatever formatting the merge field has to every inserted node.

But what about if the content I insert is

Hello, this is an italic test

. Now I would expect the inserted content to be the same font, font-size, bold and the word italic to be italic.

But because we are copying all the formatting from the merge field won’t it over-right the italic?

Hi Dale,


You’re very welcome. I was admiring your user name by the way :slight_smile:

Actually both code implementations compare a third variable, this is the default formatting (this is easier to see in the cleaner version of the code which I highly suggest to use). The compare parameter contains the default formatting of text inserted into the document. If the HTML formatting differs from the default then nothing is copied over. This means the builder and HTML formatting is combined and there shouldn’t be any formatting being overridden.

So this code should achieve what you are looking for (a combination of both formatting) and not an all or nothing scenario. In your example the output will be correct (the entire sentence bold and the one word italic).

Likewise if you had the mergefield font color set to red and the HTML snippet had green colored font then the resulting inserted text would be green. If there is there is HTML formatting then it it not overridden by the builder formatting.

I do apologise that such functionality requires a work around, we will take another crack at implementing such behavior again properly in a future version. Hopefully from this discussion you have seen just how complex two settings of formatting can be and why we made this move in the first place :slight_smile:

Thanks,