Hi There again,
I have carried out a little experiment to discover the internal structure of the above attached docx file form Aspose.Words API point of view.
Here is the code:
namespace WordRegexWithAspose
{
class Program
{
static void Main(string[] args)
{
System.Console.WriteLine("START");
//initialize a valid license here
string path = @"c:\Users\Administrator\source\repos\WordRegexWithAspose\WordRegexWithAspose\regexTest.docx";
Document wordsDocument = new Document(path);
NodeCollection nodes = wordsDocument.GetChildNodes(NodeType.Any, true);
walkNodes(wordsDocument, "");
System.Console.WriteLine("END");
}
}
static void walkNodes(Node n, string indent)
{
string originaltext = n.GetText();
string changedText = changeControlChars(originaltext);
System.Console.WriteLine("{0}{1} \t \"{2}\"", indent, n.GetType().ToString(), changedText);
if (n.IsComposite)
{
CompositeNode cn = (CompositeNode)n;
NodeCollection nodes = cn.GetChildNodes(NodeType.Any, false);
string cindent = indent + "\t";
foreach (Node child in nodes)
{
walkNodes(child, cindent);
}
}
}
static string changeControlChars(string input)
{
string output = "";
foreach (char c in input)
{
if (Char.IsControl(c))
{
int ci = c;
//System.Console.WriteLine("Control character found {0:X}", ci);
output += String.Format("x{0}", ci.ToString("X"));
}
else
{
output += c;
}
}
return output;
}
The basic idea here is to walk through the hierarchy of nodes in the document and see the document text at any given node level.
Note: non visible control chars are replaced with their hex value eg:
carriage return xD , new page xC, etc…
The above code produces the following output:
START
Aspose.Words.Document "12345678901xD12345678902xD12345678903xD12345678904xD12345678905xC"
Aspose.Words.Section "12345678901xD12345678902xD12345678903xD12345678904xD12345678905xC"
Aspose.Words.Body "12345678901xD12345678902xD12345678903xD12345678904xD12345678905xC"
Aspose.Words.Paragraph "12345678901xD"
Aspose.Words.Run "12345678901"
Aspose.Words.BookmarkStart ""
Aspose.Words.BookmarkEnd ""
Aspose.Words.Paragraph "12345678902xD"
Aspose.Words.Run "12345678902"
Aspose.Words.Paragraph "12345678903xD"
Aspose.Words.Run "12345678903"
Aspose.Words.Paragraph "12345678904xD"
Aspose.Words.Run "12345678904"
Aspose.Words.Paragraph "12345678905xC"
Aspose.Words.Run "12345678905"
END
Apparently at the very bottom of this tree there are the leaf nodes eg: Runs that, to my understanding, conatin the actual text “segments” so to speak.
One level up there are the Paragraphs, that contain the text from the Run plus some control characters.
And the higher we go, the more low level nodes are combined together.
Clearly, the above regular expression should match to any of the Runs, so the questions arise:
- At which node level does the Range.Replace operate? Section, body, paragraph or run?
- Why Range.Replace does not match the regex against the text at any level?
- Does it get confused by the control characters at Paragraph or above level?
- Am I misusing Range.Replace? Is it not the intended purpose of this function?
- Does it not, by any chance, support full line matching eg. ^ … $ ? Because in my experience, if the leading ^ and the trailing $ characters are removed from the regex, the matching works just fine. However “^[1-9]{1}[0-9]{10}$” and “[1-9]{1}[0-9]{10}” are not the same from regex point of view, I think we can agree on that.
- How can I manually implement a regex find and replace logic if Range.Replace does not work?
Thanks,
Zoltan