Range.Replace does not replace the numbers using C#

zpopswat · March 27, 2020, 11:47am

Hi There,

I am trying to use Aspose.Words product to detect and replace string in word documents using regular expressions. Unfortunately it is unable to find any match.

Here is the example code

using System;
using Aspose.Words;
using System.Text.RegularExpressions;

namespace WordRegexWithAspose
{
class Program
{
    static void Main(string[] args)
    {
        System.Console.WriteLine("START");

        //initialize a valid license here

        string path = @"c:\Users\Administrator\source\repos\WordRegexWithAspose\WordRegexWithAspose\regexTest.docx";
        Document wordsDocument = new Document(path);

        wordsDocument.Range.Replace(new Regex(@"^[1-9]{1}[0-9]{10}$"), "REPLACED");
        wordsDocument.Save(@"c:\Users\Administrator\source\repos\WordRegexWithAspose\WordRegexWithAspose\regexTest2.docx");

        System.Console.WriteLine("END");
    }
}

Looks simple a straightforward however unfortunately it does not work.
Can you please advise what the problem might be?

Thanks,
Zoltan

Please find my test file attached:
regexTest.zip (9.7 KB)

zpopswat · March 27, 2020, 12:50pm

Hi There again,

I have carried out a little experiment to discover the internal structure of the above attached docx file form Aspose.Words API point of view.

Here is the code:

namespace WordRegexWithAspose
{
    class Program
    {
        static void Main(string[] args)
        {
            System.Console.WriteLine("START");

            //initialize a valid license here

            string path = @"c:\Users\Administrator\source\repos\WordRegexWithAspose\WordRegexWithAspose\regexTest.docx";
            Document wordsDocument = new Document(path);

            NodeCollection nodes = wordsDocument.GetChildNodes(NodeType.Any, true);
            walkNodes(wordsDocument, "");

            System.Console.WriteLine("END");
        }
}

static void walkNodes(Node n, string indent)
{
    string originaltext = n.GetText();
    string changedText = changeControlChars(originaltext);
    System.Console.WriteLine("{0}{1} \t \"{2}\"", indent, n.GetType().ToString(), changedText);

    if (n.IsComposite)
    {
        CompositeNode cn = (CompositeNode)n;
        NodeCollection nodes = cn.GetChildNodes(NodeType.Any, false);
        string cindent = indent + "\t";
        foreach (Node child in nodes)
        {
            walkNodes(child, cindent);
        }
    }
}

static string changeControlChars(string input)
{
    string output = "";

    foreach (char c in input)
    {
        if (Char.IsControl(c))
        {
            int ci = c;
            //System.Console.WriteLine("Control character found {0:X}", ci);
            output += String.Format("x{0}", ci.ToString("X"));
        }
        else
        {
            output += c;
        }
    }
    return output;
}

The basic idea here is to walk through the hierarchy of nodes in the document and see the document text at any given node level.

Note: non visible control chars are replaced with their hex value eg:
carriage return xD , new page xC, etc…

The above code produces the following output:

START
Aspose.Words.Document 	 "12345678901xD12345678902xD12345678903xD12345678904xD12345678905xC"
	Aspose.Words.Section 	 "12345678901xD12345678902xD12345678903xD12345678904xD12345678905xC"
		Aspose.Words.Body 	 "12345678901xD12345678902xD12345678903xD12345678904xD12345678905xC"
			Aspose.Words.Paragraph 	 "12345678901xD"
				Aspose.Words.Run 	 "12345678901"
				Aspose.Words.BookmarkStart 	 ""
				Aspose.Words.BookmarkEnd 	 ""
			Aspose.Words.Paragraph 	 "12345678902xD"
				Aspose.Words.Run 	 "12345678902"
			Aspose.Words.Paragraph 	 "12345678903xD"
				Aspose.Words.Run 	 "12345678903"
			Aspose.Words.Paragraph 	 "12345678904xD"
				Aspose.Words.Run 	 "12345678904"
			Aspose.Words.Paragraph 	 "12345678905xC"
				Aspose.Words.Run 	 "12345678905"
END

Apparently at the very bottom of this tree there are the leaf nodes eg: Runs that, to my understanding, conatin the actual text “segments” so to speak.
One level up there are the Paragraphs, that contain the text from the Run plus some control characters.
And the higher we go, the more low level nodes are combined together.

Clearly, the above regular expression should match to any of the Runs, so the questions arise:

At which node level does the Range.Replace operate? Section, body, paragraph or run?
Why Range.Replace does not match the regex against the text at any level?
Does it get confused by the control characters at Paragraph or above level?
Am I misusing Range.Replace? Is it not the intended purpose of this function?
Does it not, by any chance, support full line matching eg. ^ … $ ? Because in my experience, if the leading ^ and the trailing $ characters are removed from the regex, the matching works just fine. However “^[1-9]{1}[0-9]{10}$” and “[1-9]{1}[0-9]{10}” are not the same from regex point of view, I think we can agree on that.
How can I manually implement a regex find and replace logic if Range.Replace does not work?

Thanks,
Zoltan

tahir.manzoor · March 27, 2020, 5:27pm

@zpopswat

In your case, you are not using the correct regex according to your requirement. You can check your regex with shared text online. Could you please ZIP and attach your input and expected output documents along with some more detail about your test case and requirement? We will then provide you more information about your query.

zpopswat · March 30, 2020, 8:53am

@tahir.manzoor

Please find the input attached to my initial post ( regexTest.zip (9.7 KB))

The use case is the following:
I have multiple numbers in my input docx file eg:

12345678901
12345678902
12345678903
12345678904
12345678905

I want these numbers detected and replaced with the given replacement string e.g: “REPLACED”

I have done so. I’ve copied and pasted the list of numbers from above and apparently the “^[1-9]{1}[0-9]{10}$” regex did not result in any matches unless the multiline option was switched on.
However with the multiline on the regex nicely found all numbers.

So based on this experience I went ahead and amended my C# test code as follows

wordsDocument.Range.Replace(new Regex(@"^[1-9]{1}[0-9]{10}$", RegexOptions.Multiline), "REPLACED");

But still no luck even with this change.

tahir.manzoor · March 30, 2020, 4:43pm

@zpopswat

We have tested the scenario and have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-20212. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

zpopswat · March 31, 2020, 8:36am

@tahir.manzoor

Thank you for confirming this.
Can you please put a timeline to the fix?
We have a solution that depends on this fix, so we would appreciate a quick resolution.

tahir.manzoor · March 31, 2020, 3:25pm

@zpopswat

We try our best to deal with every customer request in a timely fashion, we unfortunately cannot guarantee a delivery date to every customer issue. We work on issues on a first come, first served basis. We feel this is the fairest and most appropriate way to satisfy the needs of the majority of our customers.

Currently, your issue is pending for analysis and is in the queue. Once we complete the analysis of your issue, we will then be able to provide you an estimate.

aspose.notifier · May 12, 2020, 7:32am

The issues you have found earlier (filed as WORDSNET-20212) have been fixed in this Aspose.Words for .NET 20.5 update and this Aspose.Words for Java 20.5 update.

zpopswat · June 21, 2021, 9:22am

Team,

I tried this fix with the latest Words (21.6.0) and I can confirm that the fix does not work, therefore I’d like to reopen this issue.

Please find an example solution attached that reproduces the issue.

RegexTestWithWord_WordsNet_20212.zip (113.9 KB)

tahir.manzoor · June 21, 2021, 3:41pm

@zpopswat

We suggest you please read the following article.
Find and Replace String using Metacharacters

Please use following line of code in your application to get the desired output.

wordsDocument.Range.Replace(new Regex(@"^[1-9]{1}[0-9]{10}[&p]$", RegexOptions.Multiline), "REPLACED&p");

zpopswat · June 22, 2021, 10:30am

@tahir.manzoor thank you. It works with the metacharacter thing added to the regex.

Can you please refer me to the spec. of all available metacharacters available for use with Aspose.Words?
The document you linked above only mentions four: &p, &l, &b, &m but I reckon there are more

Also, do you reckon that metacharacters can be located at the end of the line, I mean does this makes sense at all:
e.g:

wordsDocument.Range.Replace(new Regex(@"^[&l][1-9]{1}[0-9]{10}[&p]$", RegexOptions.Multiline), "&lREPLACED&p");

tahir.manzoor · June 22, 2021, 1:14pm

@zpopswat

Please check the detail of Range.Replace method. This method uses following special meta-characters.

&p - paragraph break
&b - section break
&m - page break
&l - manual line break

The &p is used for paragraph break. The paragraph break character comes at the end of paragraph. Please check the attached image for detail. paragraph break.png (20.9 KB)

Please check the code examples shared in following article.

zpopswat · June 24, 2021, 10:24am

@tahir.manzoor: thank you for your help. It is very much appreciated!

zpopswat · October 4, 2021, 12:42pm

@tahir.manzoor: taking your advice, I updated my regex to take metacharacters into consideration when replacing text. So the updated regex looks as follows:

wordsDocument.Range.Replace(new Regex(@"^[1-9]{1}[0-9]{10}(?<line_end_metachars>[&p|&m|&b|&l]*)$", RegexOptions.Multiline), "REPLACED", options);

The idea still is that we want to detect full line matches that contain the requested number expressions described in the regular expression, however this time I take metachars into consideration and I collect them in a named group. The named group is then processed in a IReplacingCallback implementation that is passed into the Range.Replace function.

For details please take a look into the attached solution and the example docx file in it.
WordRangeReplaceWithMetacharacters.zip (202.5 KB)

My problem here is that in my experience detecting metacharacters partly works partly doesn’t.

More specifically the above described regex only finds lines that end with paragraph breaks eg: 12345678901, 12345678902, 12345678903, 12345678904, 12345678905

However it is not able to find lines that are terminated with

line break: eg. 72345678901, 72345678902
section break: eg. 72345678903
page break: eg: 62345678902

Can you please take a look into this solution and see if there is something I’m doing wrong or this metacharacter detection indeed does not work as expected.

Thank you,
Zoltan

tahir.manzoor · October 4, 2021, 3:33pm

@zpopswat

We have tested the scenario and managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-22814. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

aspose.notifier · November 5, 2021, 9:13am

The issues you have found earlier (filed as WORDSNET-22814) have been fixed in this Aspose.Words for .NET 21.11 update also available on NuGet.