Document.Range.Replace is incredibly slow

DTS_Barry · November 23, 2010, 2:16pm

We are looking at Aspose.Word to handle our document production and conversion to PDF. Our requirement is to replace tokens contained in a Word document with content (sometimes Rich/HTML) from a database. These replacement tokens are in the format “{r*}”. Here is the RegEx required to find the tokens:

Regex(@"\{r(?.*?)?\}", RegexOptions.Multiline);

I’m currently testing this with one of our worst case scenario templates. I’m only running the code provided in the documentation for highlighting replacement text (https://docs.aspose.com/words/net/find-and-replace/). I haven’t looked at actually adding the replacement yet. It is taking consistently over 30 minutes to process a 200 page document that contains almost 20,000 tokens (Processed 19435 tokens in 38 minutes). I imagine that adding replacement code will only add to the execution time. We need to produce several hundred of these a day and only being able to do one or two an hour is not going to work.

Am I missing something?
Is there a better way to implement this functionality?
Would a DOCX format document be faster?
Would it be possible to parse the document by element and build a new one, inserting replacements where necessary? Would this be faster?

Thank you for looking into this.

adam.skelton · November 23, 2010, 5:13pm

Hi Barry,
Thanks for your inquiry.
Could you please attach your document here for testing. I am not quite sure how fast regular expressions are evaluated over the DOM but they should not be taking that long. Have you tried running the replace with an empty handler (no code inside the Replacing method). Does it still take just as long?
I would highly suggest you to use mail merge instead. You can find details of this in the documentation here. There should be no issue with speed then, however since you are using a large number of place holders I’m not sure if you want to change them to merge fields. This could be done programmatically as well though.
Thanks,

DTS_Barry · November 24, 2010, 2:26pm

Thank you for the response. Unfortunately, my company won’t let me release the document. I can try and create a sanitized, gibberish version for testing. I will post as soon as I have one available.

I have tested with an empty replacement evaluator and it takes roughly the same amount of time.

I’m not familiar with the mail merge process. Does the mail merge process allow for the insertion of rich content (HTML with tables, etc.)? What would be the process for converting our document’s replacement tokens to mail merge fields?

Thank you,

adam.skelton · November 24, 2010, 3:47pm

Hi Barry,
Thanks for this additional information.
Sure, I will keep my eye out for when you attach the document.
Yes, you can also set up your own custom logic during mail merge so the same way you going to insert the content using a replace handler you can also use a mail merge handler to insert any type of content in place of a mergefield. Please see the code here which shows how to insert HTML content during mail merge.
Regarding the replacement of tokens with mergefields, the technique would involve a one off process of your document to replace them using a similar technique to the one described here. This of course uses a replacement evaluator so it may take a while but it would only be one off.
Thanks,

DTS_Barry · November 24, 2010, 6:36pm

Thanks again. I’ll review the documentation you mentioned. I’m attaching a garbled version of our document template. Let me know if there’s anything else I can do to help you resolve this issue.

Thank you,

alexey.noskov · November 25, 2010, 8:38am

Hi

Thank you for additional information. You should note that searching is always a quite expensive operation, especially in large documents. I agree with Adam that using mergefields instead of placeholders is better approach.
For example, on your document, you have a table on page 6, I suppose you get data to fill this table from your database. In case of using mail merge, you can easily fill the table with data. You should use Mail Merge with Regions feature to achieve this:
https://docs.aspose.com/words/java/types-of-mail-merge-operations/
In case of using text placeholder you will need to write your own code to repeat rows in table, but Mail Merge with Regions does this for you automatically.
Please let us know if you need more assistance, we will be glad to help you.
Best regards,

DTS_Barry · December 16, 2010, 5:01pm

Sorry, it’s been a while. I was pulled off to another project. I’m back to this now and I’ve tried the code you pointed at to replace our tokens with mail merge fields. I’m receiving the following error:

Unable to cast object of type ‘Aspose.Words.Fields.FieldSeparator’ to type ‘Aspose.Words.Run’.

The code snippet that throws the error:

foreach(Run run in runs)
   run.Remove();

It appears the first element in the ArrayList is a FieldSeparator.

Any suggestions?

alexey.noskov · December 17, 2010, 1:13am

Hi Barry,
Thanks for your request. Do you replace your placeholders with mergefields programmatically? If so, please attach your input document and code here for testing. Also, I think it would be easier to change your template in MS Word and then use Mail Merge technique as I suggested.
Best regards,

DTS_Barry · December 17, 2010, 1:56am

Yes, I’m using the code you pointed to earlier to replace the tokens programmatically. You can use the sample document I posted earlier in the thread for testing. I’m using the provided sample for ReplaceEvaluatorFindAndInsertMergefield pretty much as-is, except for adding one line to count the number of replacements.

Unfortunately, we have thousands of templates that are like this that would need to be converted to using mail merge fields and doing it manually isn’t an option, I’m afraid.

Here’s the full code:

using System;
using System.Collections;
using System.Diagnostics;
using System.Text.RegularExpressions;
using Aspose.Words;

namespace AsposePlayground
{
    internal class Program
    {
        public static readonly Regex ReplacementCodeRegex = new Regex(@"{r(?.*?)?}", RegexOptions.Multiline);
        private static int _replacementCount;

        private static void Main(string[] args)
        {
            var license = new License();
            license.SetLicense("Aspose.Total.lic");

            var doc = new Document(@"sampletemplate.doc");

            Stopwatch sw = Stopwatch.StartNew();
            doc.Range.Replace(ReplacementCodeRegex, new ReplaceEvaluatorFindAndInsertMergefield(), true);
            sw.Stop();
            Console.WriteLine(String.Format("Processed {0} fields in {1} minutes.", _replacementCount, sw.Elapsed.Minutes));
            doc.Save(@"mailmergetemplate.doc");
        }

        #region Nested type: ReplaceEvaluatorFindAndInsertMergefield

        private sealed class ReplaceEvaluatorFindAndInsertMergefield : IReplacingCallback
        {
            #region IReplacingCallback Members
            ///
            /// This method is called by the Aspose.Words find and replace engine for each match.
            /// This method highlights the match string, even if it spans multiple runs.
            ///
            ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
            {
                // This is a Run node that contains either the beginning or the complete match.
                Node currentNode = e.MatchNode;

                // The first (and may be the only) run can contain text before the match,
                // in this case it is necessary to split the run.
                if (e.MatchOffset > 0)
                    currentNode = SplitRun((Run)currentNode, e.MatchOffset);

                // This array is used to store all nodes of the match for further removing.
                var runs = new ArrayList();

                // Find all runs that contain parts of the match string.
                int remainingLength = e.Match.Value.Length;
                while ((remainingLength > 0) &&
                    (currentNode != null) &&
                    (currentNode.GetText().Length <= remainingLength))
                {
                    runs.Add(currentNode);
                    remainingLength = remainingLength - currentNode.GetText().Length;

                    // Select the next Run node.
                    // Have to loop because there could be other nodes such as BookmarkStart etc.
                    do
                    {
                        currentNode = currentNode.NextSibling;
                    } while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
                }

                // Split the last run that contains the match if there is any text left.
                if ((currentNode != null) && (remainingLength > 0))
                {
                    SplitRun((Run)currentNode, remainingLength);
                    runs.Add(currentNode);
                }

                // Create Document Buidler aond insert MergeField
                var builder = new DocumentBuilder(e.MatchNode.Document as Document);
                builder.MoveTo((Run)runs[runs.Count - 1]);
                string fieldName = e.Match.Groups["code"].Value;
                builder.InsertField(string.Format("MERGEFIELD {0}", fieldName), string.Format("½{0}╗", fieldName));
                _replacementCount++;
                // Now remove all runs in the sequence.
                foreach (Run run in runs)
                    run.Remove();

                // Signal to the replace engine to do nothing because we have already done all what we wanted.
                return ReplaceAction.Skip;
            }

            #endregion

            ///
            /// Splits text of the specified run into two runs.
            /// Inserts the new run just after the specified run.
            ///
            private static Run SplitRun(Run run, int position)
            {
                var afterRun = (Run)run.Clone(true);
                afterRun.Text = run.Text.Substring(position);
                run.Text = run.Text.Substring(0, position);
                run.ParentNode.InsertAfter(afterRun, run);
                return afterRun;
            }
        }
        #endregion
    }
}

alexey.noskov · December 18, 2010, 10:51am

Hi
Thanks for your request. The code works perfect with small modification:

doc.Range.Replace(ReplacementCodeRegex, new ReplaceEvaluatorFindAndInsertMergefield(), false);

In this case find/replace engine start processing the document from the end. Hope this helps.
However, I still think, that it would be better to redesign your templates and use Mail Merge with regions to fill tabular data in your documents.
Best regards.