Converting text into SDT (finding text based on the custom TAGS with regex)

NikName · March 8, 2016, 5:41am

Hi Team,
Let me first make some introduction, in order for you to better understand a problem that I have bumped into, and excuse me if this post is longer than the one you are used to.
In our company one of our main products rely on a Proposal document generation process, meaning that we have a word document with custom tags and formulas that we know how to interpret.
First phase is to upload a document, and process all custom tags and convert their matches into SDTs (Structure document tags). Once that phase is done, that will become our template, which we can use to process Word proposals to our customers (searching for SDTs in those templates, and replacing those SDTs with a real data).
Anyway, before we were using Interop services to manipulate Word, and create Open XML controls, and now, we switched to the Aspose.Words.
OK, now please let me ask my questions (issues that I have bumped into):

How to convert a regex match, into a SDT, keeping all formatting properties, and the text structure inside?
How to identify all Runs which holds that text (found by a regex match)?
How to choose a SDT block or an inline SDT at a runtime (I don’t know if my custom tag is between texts or in paragraphs)?

I am uploading two documents, first document (input.docx) with custom tags (simplified example), and the other one is desired output(converted file into document with SDT-s, output.docx) As you can see in this documents, text formatting should be preserved after processing.
I don’t know if this is the best approach to convert a file like this. Should I replace all text inside ReplaceEvaluator callback, when a regex match is found?
My current code logic is this:
find match

var regexPattern = @"<<MAIN_TAG.*?MAIN_TAG_END>>";
var regex = new Regex(regexPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
Regex regex = new Regex(tagInfo.SearchRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
document.Range.Replace(regex, new MyReplaceEvaluator(tagInfo), false);

replace evaluator callback

Node currentNode = args.MatchNode;
Document doc = (Document)currentNode.Document;
foreach (TagInfo tagInfo in TagInfo.ChildTags)
{
    ConvertTag(doc, tagInfo);
}

string currentMatchText = args.Match.Value;
// create a new SDT based on the regex match
// just remove custom TAGS used in regex
StructuredDocumentTag sdt = CreateNewSdt(doc, (Run)currentNode, currentMatchText);
// find all runs that this match belongs to
List runs = FindAndSplitMatchRuns(currentNode, currentMatchText, args.MatchOffset);
// find all parent nodes of those runs (those will be paragraphs, I suppose)
List nodesToRemove = FindAllNodesToRemove(runs);
// insert newly created SDT after the last parent node that we have found
nodesToRemove.Last().ParentNode.InsertAfter(sdt, (CompositeNode)nodesToRemove.Last());
RemoveNodes(nodesToRemove);
// Signal to the replace engine to do nothing because we have already done all what we wanted.

return ReplaceAction.Skip;

Don’t bother with a TagInfo class; it is a helper to keep track of my child tags, SDT title, what to remove from the beginning or the end regex match, etc…
But here is where I have a problem: FindAndSplitMatchRuns method. I have found the example somewhere in your ASPOSE code examples, but that code does not work. Since, it assumes that Runs that you have found, and that you want to delete, are inside one parent Paragraph. Which in my case is not true.
How can I, find all Runs (and keep formatting on those), and convert all text inside match that I have found, into a SDT with its entire child SDTs?

Once again, apologies for the longer post,
Thanks in advance

tahir.manzoor · March 9, 2016, 4:52am

Hi Nikola,

Thanks for your inquiry and sharing your requirement in detail. Please spare us some time for the analysis of your desired output. We will get back to you soon with code example according to your requirements.

tahir.manzoor · March 14, 2016, 3:07am

Hi Nikola,

Thanks for your patience. In your case, we suggest you following solution.

Implement IReplacingCallback interface and insert bookmark at the position of your custom tags e.g MAIN_TAG
Create empty document.
Insert StructuredDocumentTag tags into new empty document (target document)
Extract contents between bookmarks and insert these contents into StructuredDocumentTag
Repeat the same process for all your custom tags except inline tags.

Please read following articles for your kind reference.
Extract Content Overview and Code
Extract Content from a Bookmark
Working with Content Control (SDT)

Please get the code of ExtractContent and GenerateDocument methods in Aspose.Words for .NET examples repository at GitHub. We have attached the input/output documents with this post for your kind reference.

Please use following code example to achieve your requirements. Hope this helps you.

Document doc = new Document(MyDir + "Input.docx");
doc.Range.Replace(new Regex("<<MAIN_TAG>>"), new FindAndInsertBookmark("MAIN_TAG"), false);
doc.Range.Replace(new Regex("<<MAIN_TAG_END>>"), new FindAndInsertBookmark("MAIN_TAG_END"), false);
doc.Range.Replace(new Regex("<<INSIDE_TAG>>"), new FindAndInsertBookmark("INSIDE_TAG"), false);
doc.Range.Replace(new Regex("<<INSIDE_TAG_END>>"), new FindAndInsertBookmark("INSIDE_TAG_END"), false);Document targetdoc = new Document();
DocumentBuilder builder = new DocumentBuilder(targetdoc);
// --- Extract contents of MAIN_TAG and insert them into target document. -----//
// Add MAIN_TAG to document
StructuredDocumentTag MAIN_TAG = new StructuredDocumentTag(targetdoc, SdtType.RichText, MarkupLevel.Block);
targetdoc.FirstSection.Body.AppendChild(MAIN_TAG);
MAIN_TAG.RemoveAllChildren();
MAIN_TAG.AppendChild(new Paragraph(targetdoc));
MAIN_TAG.Title = "MAIN_TAG";
ArrayList extractedNodes = ExtractContent(doc.Range.Bookmarks["MAIN_TAG"].BookmarkStart.ParentNode, doc.Range.Bookmarks["MAIN_TAG_END"].BookmarkStart.ParentNode, false);
Document docTemp = GenerateDocument(doc, extractedNodes);
builder.MoveTo(MAIN_TAG.FirstChild);
builder.InsertDocument(docTemp, ImportFormatMode.KeepSourceFormatting);
builder.MoveToDocumentEnd();
builder.Writeln();
// ----- Extract contents of INSIDE_TAG and insert them into target document. -----//
extractedNodes.Clear();
extractedNodes = ExtractContent(doc.Range.Bookmarks["INSIDE_TAG"].BookmarkStart.ParentNode, doc.Range.Bookmarks["INSIDE_TAG_END"].BookmarkStart.ParentNode, false);
docTemp = GenerateDocument(doc, extractedNodes);
Paragraph paragraph = (Paragraph)targetdoc.Range.Bookmarks["INSIDE_TAG"].BookmarkStart.ParentNode;
Paragraph paragraphend = (Paragraph)targetdoc.Range.Bookmarks["INSIDE_TAG_END"].BookmarkStart.ParentNode;
paragraph.Runs.Clear();
paragraphend.Runs.Clear();
// Add MAIN_TAG to document
StructuredDocumentTag INSIDE_TAG = new StructuredDocumentTag(targetdoc, SdtType.RichText, MarkupLevel.Block);
INSIDE_TAG.RemoveAllChildren();
INSIDE_TAG.AppendChild(new Paragraph(targetdoc));
INSIDE_TAG.Title = "INSIDE_TAG";
// Paragraph paragraph = (Paragraph)targetdoc.Range.Bookmarks["INSIDE_TAG"].BookmarkStart.ParentNode;
MAIN_TAG.InsertBefore(INSIDE_TAG, paragraph);
extractedNodes.Clear();
extractedNodes = ExtractContent(doc.Range.Bookmarks["INSIDE_TAG"].BookmarkStart.ParentNode, doc.Range.Bookmarks["INSIDE_TAG_END"].BookmarkStart.ParentNode, false);
docTemp = GenerateDocument(doc, extractedNodes);
builder.MoveTo(INSIDE_TAG.FirstChild);
builder.InsertDocument(docTemp, ImportFormatMode.KeepSourceFormatting);
// Remove contents between INSIDE_TAG and INSIDE_TAG_END
builder.MoveTo(paragraph);
builder.StartBookmark("Temp");
builder.MoveTo(paragraphend);
builder.EndBookmark("Temp");
targetdoc.Range.Bookmarks["Temp"].Text = "";
INSIDE_TAG.LastChild.Remove();
MAIN_TAG.LastChild.Remove();
targetdoc.Range.Bookmarks["Temp"].BookmarkStart.ParentNode.Remove();
targetdoc.Range.Replace("<<INLINE_TAG", "", false, false);
targetdoc.Range.Replace(">>", "", false, false);
targetdoc.FirstSection.Body.FirstParagraph.Remove();
targetdoc.Save(MyDir + "Out.docx");

public class FindAndInsertBookmark : IReplacingCallback
{
    String bookmarkname;
    DocumentBuilder builder;
    public FindAndInsertBookmark(String name)
    {
        bookmarkname = name;
    }
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        Node currentNode = e.MatchNode;
        // Create Document Buidler 
        if (builder == null)
            builder = new DocumentBuilder(e.MatchNode.Document as Document);
        builder.MoveTo(currentNode);
        // Insert bookmarks at the place of Tags
        builder.StartBookmark(bookmarkname);
        builder.EndBookmark(bookmarkname);
        return ReplaceAction.Skip;
    }
}