Delete text between two tags using Aspose.Words for Java

saranyasrinivasan92 · June 2, 2020, 3:56pm

Step 1: Input Document
Inputfile.zip (19.3 KB)

consider tags details :
start tag: $$$abcstart{
end tag : }abcend$$$

Step 2: Find start and end tag and delete all content from start and end tag

Expected output: Outputfile.zip (18.4 KB)

tahir.manzoor · June 2, 2020, 5:44pm

In your case, we suggest you please bookmark the content that you want to delete. You can use following steps to achieve your requirement.

Please read the article Find and Replace.
Implement IReplacingCallback interface and use Range.Replace method to find the start tag.
In IReplacingCallback.Replacing, move the cursor to the matched node.
Insert the BookmarkStart node.
Do the same for end tag and insert BookmarkEnd node.
Use Bookmark.Text property to set bookmark’s text with empty string.

Hope this helps you.

saranyasrinivasan92 · June 3, 2020, 3:15pm

Source.zip (82.5 KB)

check attachment for sample code , input and current output and expected output

I tried adding bookmark but not working,Can you please help with sample code.

tahir.manzoor · June 3, 2020, 7:11pm

@saranyasrinivasan92

We are working over your query and will get back to you soon.

tahir.manzoor · June 4, 2020, 10:00am

@saranyasrinivasan92

Following code example shows how to bookmark the desired content and remove them. Hope this helps you.

string StartTag = @"$abcstart{";
string EndTag = @"}abcend$";
Document doc = new Document(MyDir + "RegaxInputFile.docx");//Size 22k
FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmark("bookmark", true);
options.Direction = FindReplaceDirection.Backward;
options.MatchCase = false;
doc.Range.Replace(StartTag, "", options);

options.ReplacingCallback = new FindAndInsertBookmark("bookmark", false);
doc.Range.Replace(EndTag, "", options);

doc.UpdatePageLayout();
Bookmark bookmark = doc.Range.Bookmarks["bookmark"];
bookmark.Text = "";
doc.Save(MyDir + "20.6.docx");

public class FindAndInsertBookmark : IReplacingCallback
{
    string bmname;
    Boolean isStart;
    DocumentBuilder builder;
    public FindAndInsertBookmark(string bmname, Boolean isStart)
    {
        this.bmname = bmname;
        this.isStart = isStart;
    }
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        if (builder == null)
            builder = new DocumentBuilder((Document)currentNode.Document);

        // The first (and may be the only) run can contain text before the match, 
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node. 
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        if (isStart)
        {
            Run run = (Run)runs[0];
            run.ParentNode.InsertBefore(new BookmarkStart(run.Document, bmname), run);
        }
        else
        {
            Run run = (Run)runs[0];
            run.ParentNode.InsertAfter(new BookmarkEnd(run.Document, bmname), run);
        }


        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.Skip;
    }

    /// <summary>
    /// Splits text of the specified run into two runs.
    /// Inserts the new run just after the specified run.
    /// </summary>
    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring(0, position);
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

saranyasrinivasan92 · June 4, 2020, 11:18am

Thanks a lot its working for one start and end tag , but not supporting for duplicate whether it is possible ?

in input document like $abcstart{ some text table content }abcend$
have duplicates of same tag
$abcstart{ some text table content }abcend$
$abcstart{ some text table content }abcend$
$abcstart{ some text table content }abcend$
trying to delete one set of tags with same start and end tags. Please suggest code sample.

current code output is also not as expected output
should delete contain $abcstart{ some text table content }abcend$
but deleting till $ is still not deleted.
updated code Source.zip (82.6 KB)

tahir.manzoor · June 4, 2020, 6:17pm

@saranyasrinivasan92

We have modified the code according to your new requirement. We have attached the output document for your kind reference.
20.6.zip (24.9 KB)

string StartTag = @"$abcstart{";
string EndTag = @"}abcend$";
Document doc = new Document(MyDir + "RegaxInputFile.docx"); 
FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmark("bookmark", true);
options.Direction = FindReplaceDirection.Backward;
options.MatchCase = false;
doc.Range.Replace(StartTag, "", options);

options.ReplacingCallback = new FindAndInsertBookmark("bookmark", false);
doc.Range.Replace(EndTag, "", options);

doc.UpdatePageLayout();
foreach (Bookmark bookmark in doc.Range.Bookmarks)
    bookmark.Text = "";

doc.Save(MyDir + "20.6.docx");

public class FindAndInsertBookmark : IReplacingCallback
{
    string bmname;
    int i = 1;
    Boolean isStart;
    DocumentBuilder builder;
    public FindAndInsertBookmark(string bmname, Boolean isStart)
    {
        this.bmname = bmname;
        this.isStart = isStart;
    }
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        if (builder == null)
            builder = new DocumentBuilder((Document)currentNode.Document);

        // The first (and may be the only) run can contain text before the match, 
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node. 
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        if (isStart)
        {
            Run run = (Run)runs[0];
            run.ParentNode.InsertBefore(new BookmarkStart(run.Document, bmname+i), run);
            i++;
        }
        else
        {
            Run run = (Run)runs[0];
            run.ParentNode.InsertAfter(new BookmarkEnd(run.Document, bmname + i), run);
            i++;
        }

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.Skip;
    }

    /// <summary>
    /// Splits text of the specified run into two runs.
    /// Inserts the new run just after the specified run.
    /// </summary>
    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring(0, position);
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

saranyasrinivasan92 · June 5, 2020, 7:33am

code is not working in all case

Sorry for confusion i have menioned the same tags which i am using, Please check attchement
code sample Source.zip (70.6 KB)

Im trying with below Case scenario

Case 1. try to delete tag string[0] having 4 duplicates in inputfile - currently have 4 duplicates and deleting only 2 for replicating use only string[0] for Case1

Case 2. try to delete two different tags data- string[0] with duplicate and string[1] without duplicate- not working in end some text still not deleted for replicating use string[0] and string [1] for Case2

tahir.manzoor · June 5, 2020, 12:50pm

@saranyasrinivasan92

It seems that you are using old version of Aspose.Words. We have tested the scenario using the latest version of Aspose.Words for .NET 20.6 and have not found the shared issue. Please check the attached output document. 20.6 (2).zip (22.4 KB)

Please update the following modified if else code snippet in IReplacingCallback.Replacing.

if (isStart)
{
    Run run = (Run)runs[0];
    run.ParentNode.InsertBefore(new BookmarkStart(run.Document, bmname+i), run);
    i++;
}
else
{
    Run run = (Run)runs[runs.Count - 1];
    run.ParentNode.InsertAfter(new BookmarkEnd(run.Document, bmname + i), run);
    i++;
}

In your use cases, you need to bookmark the content and delete them using Bookmark.Text property. So, please make sure that you are inserting the BookmarkStart and BookmarkEnd nodes correctly. Please use the same approach shared in the above code examples to delete the content.

saranyasrinivasan92 · June 5, 2020, 1:47pm

I updated with above code and version with 20.6.
Case 1: mutliple tags deleted perfectly

case 2: not working check for attchement please Source.zip (72.3 KB)

attchement which you shared alos having tag string [1] .
Im trying to delete string[0],[1],[3] in case 2

tahir.manzoor · June 5, 2020, 7:40pm

@saranyasrinivasan92

It is nice to hear from you that code works for this case.

You are iterate over tags and saving the document in for loop. Please save the document outside the loop.

For second and third tags, the bookmarks are not added into document because initialization of FindAndInsertBookmark sets the value of variable i to 1 and bookmarks are replaced.

To achieve your requirement, we suggest you following solution.

Get the bookmark count that has name started with “bookmark”.
Set the value of FindAndInsertBookmark.i to count + 1 for second and third tags.
Perform the find and replace operation for second and third tags.
In your code, you may pass the value of i to FindAndInsertBookmark constructor.

Hope this helps you.