Find & Encapsulate Word Document Text Spanning across Single or Multiple Paragraphs within Content Control C# .NET

Kunal19 · November 11, 2019, 8:12am

How to find and add content control on document text? A given text can be a part of single or multiple paragraphs.

Please find attached input and expected documentFiles.zip (32.9 KB)

awais.hafeez · November 11, 2019, 9:41am

Thanks for your inquiry. Could you please attach your input Word document and expected document here for our reference? We will investigate the structure of your expected document as to how you want your final output be generated like. You can create expected document by using Microsoft Word. We will then provide you code to achieve the same using Aspose.Words.

Kunal19 · November 11, 2019, 10:18am

Thanks for your quick response. Please find attached input and expected document files.

awais.hafeez · November 12, 2019, 12:06am

@Kunal19,

You can find and add content control on document text by using the following code:

Document doc = new Document("E:\\Temp\\Files\\InputDocument.docx");

FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndReplace();
options.Direction = FindReplaceDirection.Backward;

doc.Range.Replace("This Agreement is by and between XYZ Inc., a corporation under the laws of the state of Washington and having its principal place of business at 142711 Suite 100 Bellevue, WA (\"XYZ Inc.\") and Troy Inc.  (“Customer”), having a mailing address at 123 ABC Blvd, Building 2500, Dallas, USA 75001", "", options);
doc.Range.Replace("07/01/2015", "", options);
doc.Range.Replace("12/31/2026", "", options);
            
doc.Save("E:\\Temp\\Files\\19.11.docx");

private class FindAndReplace : IReplacingCallback
{
    /// <summary>
    /// NOTE: This is a simplistic method that will only work well when the match
    /// starts at the beginning of a run.
    /// </summary>
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further removing.
        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);
        builder.MoveTo((Run)runs[0]);

        StructuredDocumentTag sdt = new StructuredDocumentTag(builder.Document, SdtType.RichText, MarkupLevel.Inline);
        sdt.ChildNodes.Clear();
        builder.InsertNode(sdt);

        foreach (Run run in runs)
            sdt.AppendChild(run);

        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

Hope, this helps.

Kunal19 · November 12, 2019, 11:07am

Thanks @awais.hafeez,

It works perfectly for single paragraph and single text but doesn’t work for multi paragraph text.
I want to tag 2 or more paragraphs together in one content control.
Could you please help me here?

Regards,
Kunal

awais.hafeez · November 13, 2019, 2:21am

@Kunal19,

You can build logic on the following code that makes multiple Paragraphs part of a Content Control:

Document doc = new Document("E:\\Temp\\Files\\InputDocument.docx");

Paragraph targetPara = null;
foreach (Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).StartsWith("General Liability Insurance on an occurrence "))
    {
        targetPara = para;
        break;
    }
}

if (targetPara != null)
{
    StructuredDocumentTag sdt = new StructuredDocumentTag(doc, SdtType.RichText, MarkupLevel.Block);
    sdt.ChildNodes.Clear();
    targetPara.ParentNode.InsertBefore(sdt, targetPara);

    sdt.AppendChild(targetPara);
    sdt.AppendChild(sdt.NextSibling);
    sdt.AppendChild(sdt.NextSibling);
}

doc.Save("E:\\Temp\\Files\\19.11.docx");

Hope, this helps.

Kunal19 · December 2, 2019, 7:55am

Thanks @awais.hafeez.
Based on your answer,I have build a logic to tag multi paragraph text.

Another problem that I am facing now it that,below code is replacing all occurrence of given string(“07/01/2015”).
doc.Range.Replace(“07/01/2015”, “”, options);

I want to replace only if it is appearing in specific paragraph/position.

Could you please help me with problem.

Regards,
Kunal

awais.hafeez · December 2, 2019, 1:26pm

@Kunal19,

Thanks for your inquiry. Please ZIP and upload your simplified input Word document and your expected DOCX file showing the desired output here for testing. You can create expected document by using MS Word. We will then investigate the scenario on our end and provide you more information.

Kunal19 · January 10, 2020, 10:35am

@awais.hafeez,

Please find attached zip file for simplified input and expected docx file.Files.zip (27.5 KB)

awais.hafeez · January 10, 2020, 3:00pm

@Kunal19,

The following code will produce an output similar to the “ExpectedDocument.docx” document you shared:

Document doc = new Document("E:\\Temp\\Files\\InputDocument.docx");

FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndReplace();
options.Direction = FindReplaceDirection.Forward;

doc.Range.Replace("07/01/2015", "", options);

doc.Save("E:\\Temp\\Files\\20.1.docx");

private class FindAndReplace : IReplacingCallback
{
    /// <summary>
    /// NOTE: This is a simplistic method that will only work well when the match
    /// starts at the beginning of a run.
    /// </summary>
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further removing.
        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);
        builder.MoveTo((Run)runs[0]);

        StructuredDocumentTag sdt = new StructuredDocumentTag(builder.Document, SdtType.RichText, MarkupLevel.Inline);
        sdt.ChildNodes.Clear();
        builder.InsertNode(sdt);

        foreach (Run run in runs)
            sdt.AppendChild(run);

        return ReplaceAction.Stop;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

Kunal19 · January 12, 2020, 6:20am

Hi @awais.hafeez,

Above code will tag all the occurrences of given string.I want to tag only one specific(May be based on its context).

Please find attached expected document.
“07/01/2015” this text is at 2 places in the document but I want to tag it only at one place.Files.zip (27.5 KB)

awais.hafeez · January 13, 2020, 6:04am

@Kunal19,

You will get the expected output if you please change the direction to forward options.Direction = FindReplaceDirection.Forward; in main code and inside IReplacingCallback.Replacing change the last line to return ReplaceAction.Stop;. The output produced on our end is attached here for your reference:

20.1.zip (13.8 KB)

For complete code, please see my previous post. Hope, this helps.

Kunal19 · February 3, 2020, 9:43am

@awais.hafeez,

return ReplaceAction.Stop; will tag only first occurrence of string based on the direction.

Consider a example where I want to tag a particular occurrence of a string which could be either first or last or any middle occurrence.
Don’t know the exact index.
Is there any way to find out based on string’s suffix and prefix or any other technique?

Regards,
Kunal

awais.hafeez · February 4, 2020, 3:57am

@Kunal19,

You can look for a particular text/string in Paragraphs (suffix and prefix etc) and then run the replace routine only for that Paragraph where the particular text/string is found. For example:

Document doc = new Document("E:\\Temp\\Files\\InputDocument.docx");

Paragraph targetPara = null;
foreach(Paragraph para in doc.GetChildNodes(NodeType.Paragraph, true))
{
    if (para.ToString(SaveFormat.Text).Contains("some text prefix maybe"))
    {
        targetPara = para;
        break;
    }
}

if (targetPara != null)
{
    FindReplaceOptions options = new FindReplaceOptions();
    options.ReplacingCallback = new FindAndReplace();
    options.Direction = FindReplaceDirection.Forward;

    targetPara.Range.Replace("07/01/2015", "", options);
}

doc.Save("E:\\Temp\\Files\\20.1.docx");

Kunal19 · February 6, 2020, 11:30am

Thanks @awais.hafeez.

Kunal19 · May 7, 2020, 7:45am

Hi @awais.hafeez,

I am facing another problem where I am not able to add content control on a paragraph, if paragraph is ending with some already tagged(Content Control) text.

I am using following code.

var searchPattern= “This AGREEMENT.*expire on 10/02/2019.”;
doc.Range.Replace(new Regex(searchPattern, RegexOptions.Multiline), string.Empty, options);

It works if I remove content control from last word.

Attaching the input and expected document.

Could you please help me here.

Regards,
Kunal

Files.zip (30.9 KB)

awais.hafeez · May 7, 2020, 2:03pm

@Kunal19,

You are right; even the following code currently does not work:

Document doc = new Document(@"E:\Temp\Files (6)\InputDoc.docx");
var searchPattern = "This AGREEMENT.*expire on 10/02/2019.";
doc.Range.Replace(new Regex(searchPattern, RegexOptions.Multiline), "New Text");
doc.Save("E:\\Temp\\Files (6)\\20.5.docx");

For the sake of any correction, we have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-20397. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

Kunal19 · May 8, 2020, 5:36am

Thanks @awais.hafeez for quick response.

Will wait to hear from you on this issue.

Regards,
Kunal

awais.hafeez · May 8, 2020, 12:12pm

@Kunal19,

Sure, we will inform you via this thread as soon as this issue will get resolved in future.

awais.hafeez · May 21, 2020, 3:01pm

@Kunal19,

It is to update you that we have completed the analysis of WORDSNET-20397 and analysis details are as follows:

The desired behavior cannot be achieved by using Aspose.Words. When Aspose.Words searchs matching pattern, it does not include content of “Content Control” (SDT) into whole searchable text.

It seems that MS Word behaves the same way. If you try this scenario by using MS Word’s GUI, you will see MS Word does not replace anything as well. MS Word does not match if pattern crosses normal text and text within StructuredDocumentTag.

So, Aspose.Words currently mimics the MS Word’s behavior. If we can help you with anything else, please feel free to ask.