Extracting Nodes between two Run's within a Table

Etaardvark · August 13, 2018, 9:00am

I’m hitting an issue with extracting contents between two Run’s from this document (specifically the content between the two run’s needs to be duplicated a number of times, as it’s an iteration block).

Strangely the same code seems to work, apart from when both run’s are within the table, but on different rows

The two run’s i’m looking to extract and copy between in this document are “<<&foreach” as the starting Run and “<<&endfor>>” as the ending run.

I’m using the Common.ExtractContents method from the GitHub samples, but if I give it those two nodes it always returns zero nodes. It looks as if both the start and end node are evaluated as the enclosing table.

I have tried manually processing the nodes by following the NextSibling from the start node and recursively processing the node if it has children, but that only seems to give me the three RUN nodes, not any of the cell or row objects, and no containing paragraphs.

If I move the <<&EndFor>> tag so it’s outside the table, the contents extract correctly.

I’ve attached the document that is causing the problems.

If there is an easy way to select between the two runs and ‘copy’ x times that I am missing in the documentation it would be appreciated, as at the moment i’m having to manually process everything, hence having to extract the contents between the nodes.

29843.zip (6.4 KB)

Peter

tahir.manzoor · August 13, 2018, 5:40pm

@Etaardvark,

Thanks for your inquiry. The Common.ExtractContents may not work properly for your case. In your case, we suggest you following solution.

Get the parent rows of start and end tags e.g. row1 and row2.
Clone these two rows and the rows between them.
Insert the cloned rows into the table.

Hope this helps you.

Etaardvark · August 14, 2018, 9:30am

Not quite, the problem is that the Row is outside the range as far as our user expects.

With the supplied document, I think the expectation is to get a <&Foreach run><&endfor run>

The only way I can see to do this is to manually perform the clone (so that the data before or after the edge nodes is removed), but it seems like I have to navigate outside the range to be able to navigate.

e.g. If i get the ParentNode of the RUN for the starting Foreach tag, I get a Paragraph object. But if I ask for the next sibling of the paragraph it returns null, so I have to navigate out to the containing cell, to be able to traverse the document.

The situation get’s worse if there are multiple paragraphs within the cell, as the starting node may be the second paragraph within the cell.

With the old word object model this is easy to do using a Range object, just set the range to the start and end tags, then insert the formatted text for the range - this worked for tables / cells / rows regardless of where the starting / finishing tags were located ?

    If IterationRepeatCount = 0 Then
      ' No iterations required so remove fields in the foreach section
      DocRange.Delete
    Else
      Set InsertRange = DocRange.Duplicate
      InsertRange.Start = DocRange.End
   
      ' Copy for the number of iterations
      For IterationRepeatPtr = 1 To (IterationRepeatCount - 1)

        InsertRange.FormattedText = DocRange.FormattedText
        InsertRange.End = DocRange.End
      Next
    
      Set InsertRange = Nothing
    End If

Is there nothing in Aspose that can do a similar replacement based on a range - the Range object is a Fascade which only allows a replace, with no ‘clone’ functionality. so isn’t a direct replacement.

The problem with manually cloning the nodes, is how to work out where to start (i.e. from the Run node do you have to extend out to the Paragraph -> Cell -> Row ) that will work in All cases, and then exclude content that isn’t within the matching range.

This really does feel like it’s something that the API should be able to deal with - it doesn’t even look like i can do it with bookmarks, cloning the content into a new sub document, as I hit exactly the same problem extracting the nodes between the Bookmark start and end. This does appear to be something that has been asked a number of times before.

tahir.manzoor · August 14, 2018, 6:21pm

@Etaardvark

You can use Node.NextPreOrder method to get the next node according to the pre-order tree traversal algorithm.

Please note that Aspose.Words’ model is quite different from the Microsoft Word’s Object Model in that it represents the document as a tree of objects more like an XML DOM tree. If you worked with any XML DOM library you will find it is easy to understand and work with Aspose.Words. When you load a Word document into Aspose.Words, it builds its DOM and all document elements and formatting are simply loaded into memory. Please read the following article for more information on DOM:
Aspose.Words Document Object Model

Could you please ZIP and attach your input and expected output Word documents here for our reference? We will then provide you more information about your query along with the code.

Etaardvark · August 15, 2018, 7:02am

AsposeSamples.zip (12.2 KB)

I’ve attached a zip with a sample input and output documents, using our old document generation method. The <<&foreach and <<&endfor>> range has been expanded 5 times, and then the <> tags replaced with data. Also note that you get the blank row at the end (the <<&endfor>> tag is removed, but that row remained in place.

I will readily admit this is an edge case with the tags in the same table, but as i’m trying to replace the existing link using automation, I really need to be able to perform the same function.

We have the issue that the ‘template’ documents are generated by our clients, so we don’t have complete control over the layout or the format of the document (i.e. doc or docx ). Using the WordML on this document when converted it’s actually reasonably easy to copy the underlying XML for this particular document, but that’s probably not a robust solution for all cases.

So the root problem is being able to duplicate with relevant formatting the section between the two tags. I’ve got code that’s splitting the run objects so I can ensure that the nodes i am working between only contain the relevant tags, but the superposition of the tags (i.e. the context within a table for example), seems to be what is causing problems.

Peter

Etaardvark · August 15, 2018, 9:31am

I’ve isolated the problem a little more - it only seems to go wrong when there are rows / cells after the section being copied

i.e. if you have a Table with a header then on the next row the <<&foreach <<&endfor>> block in a second row, it functions correctly and the row is repeated (i.e. ExtractContent returns content).

If you have the <<&foreach <<&endfor>> on the first row of the Table, with a subsequent row is when the process fails.

I have written code to do a ‘specific’ clone of the parent table (to exclude anything not within the two tags by performing the Clone on the parent then removing any child run’s before / after the start and end nodes),This extracts the Row / Cell / Paragraphs correctly only containing the runs between the start / end tags.

However when you try and add them into the table the formatting goes wrong (as you are inserting rows / cells with data, which moves existing cells.

Is a possible solution ‘recreating’ the table from scratch (i.e. in a separate document create the table containing the tags from scratch (performing the iteration expansion as the table is created), then copying the table from the new document back, replacing the original document ?

tahir.manzoor · August 15, 2018, 1:47pm

@Etaardvark,

Thanks for sharing the detail. Please use the following code example to repeat the foreach and endfor tags. In the output document, each cell has one Run node that keeps the font formatting.

Document doc = new Document(MyDir + @"inputdoc.doc");
FindTagAndInsertBookmark obj = new FindTagAndInsertBookmark("bm_start");
FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = obj;
doc.Range.Replace("<<&foreach", "", options);

obj = new FindTagAndInsertBookmark("bm_end");
options = new FindReplaceOptions();
options.ReplacingCallback = obj;
doc.Range.Replace("<<&endfor>>", "", options);

Bookmark start = doc.Range.Bookmarks["bm_start"];
Bookmark end = doc.Range.Bookmarks["bm_end"];

   if (start != null && end != null && start.BookmarkStart.GetAncestor(NodeType.Row) != null
    && end.BookmarkStart.GetAncestor(NodeType.Row) != null)
    {
        Row row = (Row)start.BookmarkStart.GetAncestor(NodeType.Row);
        int rowindex = row.ParentTable.Rows.IndexOf(row);
        for (int i = 0; i < 10; i++)
        {
        Row cloneRow = (Row)row.Clone(true);

        foreach (Cell cell in cloneRow)
        {
            if (cell.FirstParagraph.Runs.Count > 0)
            {
                Run run = (Run)cell.FirstParagraph.Runs[0];
                run.Text = "new text";
                cell.RemoveAllChildren();
                cell.EnsureMinimum();
                cell.FirstParagraph.Runs.Add(run);
            }
        }
        row.ParentTable.Rows.Insert(rowindex + 1, cloneRow);
    }
}

doc.Save(MyDir + "18.8.docx");

public class FindTagAndInsertBookmark : IReplacingCallback
{
    private string bookmark;
    public FindTagAndInsertBookmark(string bmname)
    {
        bookmark = bmname;
    }
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;
                
        // The first (and may be the only) run can contain text before the match, 
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node. 
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }


        DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);

        //Move the cursor to the matched text and insert table
        builder.MoveTo((Run)runs[0]);
        builder.StartBookmark(bookmark);
        builder.EndBookmark(bookmark);

        // remove run nodes
        foreach (Run run in runs)
            run.Remove();

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.Skip;
    }

    /// <summary>
    /// Splits text of the specified run into two runs.
    /// Inserts the new run just after the specified run.
    /// </summary>
    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring(0, position);
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }

}

Yes, you can achieve this requirement using Aspose.Words. You can use NodeImporter.ImportNode method to import a node from one document into another.