Hi Team,
Let me first make some introduction, in order for you to better understand a problem that I have bumped into, and excuse me if this post is longer than the one you are used to.
In our company one of our main products rely on a Proposal document generation process, meaning that we have a word document with custom tags and formulas that we know how to interpret.
First phase is to upload a document, and process all custom tags and convert their matches into SDTs (Structure document tags). Once that phase is done, that will become our template, which we can use to process Word proposals to our customers (searching for SDTs in those templates, and replacing those SDTs with a real data).
Anyway, before we were using Interop services to manipulate Word, and create Open XML controls, and now, we switched to the Aspose.Words.
OK, now please let me ask my questions (issues that I have bumped into):
-
How to convert a regex match, into a SDT, keeping all formatting properties, and the text structure inside?
-
How to identify all Runs which holds that text (found by a regex match)?
-
How to choose a SDT block or an inline SDT at a runtime (I don’t know if my custom tag is between texts or in paragraphs)?
I am uploading two documents, first document (input.docx) with custom tags (simplified example), and the other one is desired output(converted file into document with SDT-s, output.docx) As you can see in this documents, text formatting should be preserved after processing.
I don’t know if this is the best approach to convert a file like this. Should I replace all text inside ReplaceEvaluator callback, when a regex match is found?
My current code logic is this:
find match
var regexPattern = @"<<MAIN_TAG.*?MAIN_TAG_END>>";
var regex = new Regex(regexPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
Regex regex = new Regex(tagInfo.SearchRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
document.Range.Replace(regex, new MyReplaceEvaluator(tagInfo), false);
replace evaluator callback
Node currentNode = args.MatchNode;
Document doc = (Document)currentNode.Document;
foreach (TagInfo tagInfo in TagInfo.ChildTags)
{
ConvertTag(doc, tagInfo);
}
string currentMatchText = args.Match.Value;
// create a new SDT based on the regex match
// just remove custom TAGS used in regex
StructuredDocumentTag sdt = CreateNewSdt(doc, (Run)currentNode, currentMatchText);
// find all runs that this match belongs to
List runs = FindAndSplitMatchRuns(currentNode, currentMatchText, args.MatchOffset);
// find all parent nodes of those runs (those will be paragraphs, I suppose)
List nodesToRemove = FindAllNodesToRemove(runs);
// insert newly created SDT after the last parent node that we have found
nodesToRemove.Last().ParentNode.InsertAfter(sdt, (CompositeNode)nodesToRemove.Last());
RemoveNodes(nodesToRemove);
// Signal to the replace engine to do nothing because we have already done all what we wanted.
return ReplaceAction.Skip;
Don’t bother with a TagInfo class; it is a helper to keep track of my child tags, SDT title, what to remove from the beginning or the end regex match, etc…
But here is where I have a problem: FindAndSplitMatchRuns method. I have found the example somewhere in your ASPOSE code examples, but that code does not work. Since, it assumes that Runs that you have found, and that you want to delete, are inside one parent Paragraph. Which in my case is not true.
How can I, find all Runs (and keep formatting on those), and convert all text inside match that I have found, into a SDT with its entire child SDTs?
Once again, apologies for the longer post,
Thanks in advance