Splitting document based on start and end text

Hi Team,

We are trying ASPOSE Java word API for a POC to check if its meeting our requirement. Our requirement is to chunk word document and extract specific content from it based on a start and end text phrase, and create a new document without losing format of the actual document. Is it possible using ASPOSE api? I have tried some samples based on the post ‘Aspose Word: Search a search word and extract content between the search word- split content between it into new document’ but its for splitting document based on repetitive text. Can you please show some samples to split the document based on start and end text? Start text may be a heading of the paragraph.

Thanks.
Renjith.

Also, in the sample I created using the example, Bookmarking is not working if the start text is a paragraph header. Thanks.

Hi Renjith,

Thanks for your inquiry. Please refer to the following article:
Extract Selected Content Between Nodes

In your case, we suggest you following solution.

  1. Find the start text and insert a bookmark e.g. bmstart.
  2. Find the end text and insert a bookmark e.g. bmend.
  3. Extract the contents between bookmarks bmstart and bmend
  4. Generate the document from extracted nodes.

Please check following code example. Hope this helps you.

//Load in the document
Document doc = new Document(MyDir + "in.docx");

DocumentBuilder builder = new DocumentBuilder(doc);

//Find text and insert bookmark for sart text
Pattern regex = Pattern.compile("start text", Pattern.CASE_INSENSITIVE);

FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmark("bmstart");

doc.getRange().replace(regex, "", options);

//Find text and insert bookmark for ending text
regex = Pattern.compile("end text", Pattern.CASE_INSENSITIVE);

options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmark("bmend");

doc.getRange().replace(regex, "", options);

//Extract contents between bookmark
BookmarkStart bStart = doc.getRange().getBookmarks().get("bmstart").getBookmarkStart();
BookmarkEnd bEnd = doc.getRange().getBookmarks().get("bmend").getBookmarkEnd();

ArrayList nodes = extractContent(bStart, bEnd, true);
Document newdoc = generateDocument(doc, nodes);

newdoc.save(MyDir + "Out.docx");
class FindAndInsertBookmark implements IReplacingCallback
{
    String bookmark;
    FindAndInsertBookmark(String bm)
    {
        bookmark = bm;
    }

    public int replacing(ReplacingArgs e) throws Exception
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();
        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run)currentNode, e.getMatchOffset());
        // This array is used to store all nodes of the match for further highlighting.
        ArrayList runs = new ArrayList();
        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while (
                (remainingLength > 0) &&
                        (currentNode != null) &&
                        (currentNode.getText().length() <= remainingLength))
        {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();
            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            }
            while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }
        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            splitRun((Run)currentNode, remainingLength);
            runs.add(currentNode);
        }
        DocumentBuilder builder = new DocumentBuilder((Document)((Run)runs.get(0)).getDocument());
        builder.moveTo((Run)runs.get(0));
        builder.startBookmark(bookmark);
        builder.endBookmark(bookmark);
        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }

    /**
      Splits text of the specified run into two runs.
      Inserts the new run just after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception
    {
        Run afterRun = (Run)run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

Thank you very much Tahir. The code works for me. However, its not including the end text in the new document. What should I do for that?

Hi Renjith,

Thanks for your inquiry. In this case, you need to insert the second bookmark after the “end text”. You can use BookmarkStart and BookmarkEnd classes to create a bookmark and use CompositeNode.InsertAfter method to insert it after the “end text” Run node.

Please let us know if you have any more queries.

Hi Tahir,

Are you referring to the below method? Its already using insertAfter method. Please guide me if I am wrong.

/**
*
* Splits text of the specified run into two runs.
*
* Inserts the new run just after the specified run.
*
*/
private Run splitRun(Run run, int position) throws Exception
{
    Run afterRun = (Run) run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring((0), (0) + (position)));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}

Thanks.

Hi Renjith,
Thanks for your inquiry. Please try the following modified code snippet. Hope this helps you.

public int replacing(ReplacingArgs e) throws Exception {
    // This is a Run node that contains either the beginning or the complete match.
    Node currentNode = e.getMatchNode();

    // The first (and may be the only) run can contain text before the match,
    // in this case it is necessary to split the run.
    if (e.getMatchOffset() > 0)
        currentNode = splitRun((Run) currentNode, e.getMatchOffset());

    ArrayList runs = new ArrayList();

    // Find all runs that contain parts of the match string.
    int remainingLength = e.getMatch().group().length();
    while ((remainingLength > 0) && (currentNode != null) && (currentNode.getText().length() <= remainingLength)) {
        runs.add(currentNode);
        remainingLength = remainingLength - currentNode.getText().length();

        // Select the next Run node.
        // Have to loop because there could be other nodes such as BookmarkStart etc.
        do {
            currentNode = currentNode.getNextSibling();
        }
        while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
    }

    // Split the last run that contains the match if there is any text left.
    if ((currentNode != null) && (remainingLength > 0)) {
        splitRun((Run) currentNode, remainingLength);
        runs.add(currentNode);
    }

    Document doc = (Document) e.getMatchNode().getDocument();
    DocumentBuilder builder = new DocumentBuilder(doc);

    if (bookmark == "bmend") {
        BookmarkStart bs = new BookmarkStart(doc, bookmark);
        BookmarkEnd be = new BookmarkEnd(doc, bookmark);
        Run run = (Run) runs.get(runs.size() - 1);

        run.getParentParagraph().insertAfter(bs, run);

        bs.getParentNode().insertAfter(be, bs);
    } else {

        builder.moveTo((Run) runs.get(0));
        builder.startBookmark(bookmark);
        builder.endBookmark(bookmark);
    }

    // Signal to the replace engine to do nothing because we have already done all what we wanted.
    return ReplaceAction.SKIP;
}

Thank you for your quick help. Its working fine. I’ll let you know if any more help needed.

Hi Renjith,

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Hi Tahir,

I would need an addition to the above logic. Suppose we would have multiple instances of starting and ending text’s. So, how do we insert a book mark suppose on the 2nd or 3rd repetition of the text, if we provide the repetition number?

Thanks in advance!

Hi Renjith,

Thanks for your inquiry. In this case we suggest you following solution.

  1. Please add an integer member in FindAndInsertBookmark class
  2. Increment its value with 1 in IReplacingCallback.replacing method
  3. Add bookmarks in document with name bmstart1, bmstart2 and so on. Do the same for bmend bookmark.
  4. Extract the contents between desired bookmarks.

If you still face problem, please share your input and expected output documents here for our reference. We will then provide you more information about your query.

Hi Tahir,

Thanks for your input. I am not clear with your solution. Suppose if I add an integer in the FindAndInsertBookmark class, it will get reset every time when I initialize it with a book mark right?

I am giving a sample input file here for your reference. My Bookmark start should be the word ‘In response to’, which is repeating twice. Bookmark end should be ‘counterparty risk’ which is repeating four times. I would need to extract content between 2nd instance of ‘In response to’ and 4th instance if ‘counterparty risk’. The repeat instances would be dynamic case to case. Can you please help me with this?

Thanks.

Hi Renjith,

Thanks for your inquiry. Please use the following modified code example to insert the bookmarks for multiple instances of ‘In response to’ and ‘counterparty risk’ into document. Please check the attached image for bookmark’s detail. Once you have inserted the bookmarks in the documents, you can extract the contents between two bookmarks as you are already doing in your code. Hope this helps you. Please let us know if you have any more queries.

class FindAndInsertBookmarks implements IReplacingCallback
{
    String bookmark;
    int i = 1;
    FindAndInsertBookmarks(String bm)
    {
        bookmark = bm;
    }
    public int replacing(ReplacingArgs e) throws Exception
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();
        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() >0)
            currentNode =splitRun((Run)currentNode, e.getMatchOffset());
        // This array is used to store all nodes of the match for further highlighting.
        ArrayList runs = new ArrayList();
        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while ((remainingLength >0) && (currentNode != null ) && (currentNode.getText().length() <= remainingLength))
        {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();
            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            }
            while
            ((currentNode != null ) && (currentNode.getNodeType() != NodeType.RUN ));
        }
        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0 ))
        {
            splitRun((Run)currentNode,
                    remainingLength);
            runs.add(currentNode);
        }
        Document doc = (Document)e.getMatchNode().getDocument();
        DocumentBuilder builder = new DocumentBuilder(doc);
        if(bookmark == "bmend" )
        {
            BookmarkStart bs = new BookmarkStart(doc, bookmark + i);
            BookmarkEnd be =new BookmarkEnd(doc, bookmark +i);
            Run run =(Run)runs.get(runs.size() -1);
            run.getParentParagraph().insertAfter(bs, run);
            bs.getParentNode().insertAfter(be, bs);
            i++;
        }
        else
        {
            builder.moveTo((Run)runs.get(0));
            builder.startBookmark(bookmark + i);
            builder.endBookmark(bookmark + i);
            i++;
        }
        // Signal to the replace engine to do nothing because we havealready done all what we wanted.
        return ReplaceAction.SKIP;
    }

    /**
     * Splits text of the specified run into two runs.
     * Inserts the new run just after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception
    {
        Run afterRun = (Run)run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}
//Load in the document
Document doc = new Document(MyDir + "09_Securities+Trading+Regulations.docx");

DocumentBuilder builder = new DocumentBuilder(doc);

//Find text and insert bookmark for sart text
Pattern regex = Pattern.compile("in response to", Pattern.CASE_INSENSITIVE);

FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmarks("in response to");

doc.getRange().replace(regex, "", options);

//Find text and insert bookmark for ending text
regex = Pattern.compile("counterparty risk", Pattern.CASE_INSENSITIVE);

options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmarks("counterparty risk");

doc.getRange().replace(regex, "", options);

Hi Tahir,

I found and issue to add a bookmark start on the attached document. My start bookmark is at the beginning of the document. It is word “INTRODUCTION” in the attached document. I am doing a case sensitive search.

//Find text and insert bookmark for starting text
Pattern regex = Pattern.compile("(?)"+"INTRODUCTION");

But it is not able to insert the bookmark. using the code above in the thread. Can you please help me out?

Thanks.

Hi Tahir,
In addition to the above request, Can we include headers and footers while adding a bookmark? i.e, add a bookmark at header text of the 1st page and end book mark at footer text of the second page/
Thanks.

Hi Renjith,

Thanks for your inquiry.
renjimat:
I found and issue to add a bookmark start on the attached document. My start bookmark is at the beginning of the document. It is word “INTRODUCTION” in the attached document. I am doing a case sensitive search.
Please set the value of FindReplaceOptions.MatchCase property to true if you want case-sensitive comparison.

In your case, the word “INTRODUCTION” is not in capital letters. The font is formatted as small capital letters. You can get/set the value of this formatting using Font.SmallCaps.
renjimat:
In addition to the above request, Can we include headers and footers while adding a bookmark? i.e, add a bookmark at header text of the 1st page and end book mark at footer text of the second page
You can use DocumentBuilder.MoveToHeaderFooter method to move the cursor to the beginning of a header or footer in the current section and insert the desired contents.

Please note that HeaderFooter is a section-level node and can only be a child of Section. There can only be one HeaderFooter or each HeaderFooterType in a Section. In your case, you need to insert the section break at the end of first page and insert the contents in the header/footer of sections.

If you face any issue, please share your expected output document. We will then provide you more information about your query along with code.

Hi Tahir,
Please find the code attached which I am using to split word document based on start and end text phrases. I will also use the text sequence number as well to add book mark in the correct text, in case the text is repeating in the document. You can find the artifacts in the attachment queries.zip.
Classes: AsposeWordChunkImpl.java, FindAndInsertBookmark.java
Input File: 09_Securities Trading Regulations.docx
Case1: I am using arguments in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its failing to mark the book mark on the end text and throwing a null pointer exception while retrieving the end bookmark. I can find the end text when manually searching in the word file.
Case 2: I am using arguments in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its throwing StringIndexOutOfBoundsException when running the splitRun method in FindAndInsertBookmark class.
Can you please check these issues and help me?
Thanks in advance.

Hi Renjith,

Thanks for your inquiry.

renjimat:

Case1: I am using arguments in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its failing to mark the book mark on the end text and throwing a null pointer exception while retrieving the end bookmark. I can find the end text when manually searching in the word file.

Please note that Aspose.Words mimics the same behavior as MS Word does. There are a few restrictions on Bookmark names e.g. the name must start with a word character (but not a digit) then any Unicode word character may follow up to an overall length of 40 characters. Microsoft Word does not support white spaces and punctuation of any kind in Bookmark’s name.

In this case, the length of end text (Distribute securities as part of securities underwriting) should be up to 40 characters. You need to pass text value in FindAndInsertBookmarks’ constructor that have length less then 40 characters. Or you may trim the bookmark’s name before inserting it.

renjimat:

Case 2: I am using arguments in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its throwing StringIndexOutOfBoundsException when running the splitRun method in FindAndInsertBookmark class.

Please use following modified splitRun method to fix this issue.

private Run splitRun(Run run,     int position) throws Exception
{
    if(run.getText().length()< position)
        position =run.getText().length();
    Run afterRun = (Run)run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring(0, position));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}

Hi Tahir,

I made the suggested changes. But still facing issues. Can you please help?

Case1: I am using arguments
in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method.
Its failing to mark the book mark on the end text and throwing a null
pointer exception while retrieving the end bookmark. I can find the end
text when manually searching in the word file. Now I am using a simple bookmark name ‘bmend1’(1- will be added in the FindAndInsertBookmark class).

Case 2: I am using arguments
in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument
method. Its throwing a null pointer exception now in the extractContent method on the line highlighted below, after the recent change suggested.

// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.getNextSibling() == null && isExtracting)
{
    // Move to the next section.
    Section nextSection = (Section)currNode.getAncestor(NodeType.SECTION).getNextSibling();
    currNode = nextSection.getBody().getFirstChild();
}
else
{
    // Move to the next node in the body.
    currNode = currNode.getNextSibling();
}

Attached my classes and artifacts in the queries.zip

Thanks.

Hi Renjith,

Thanks for your inquiry.

renjimat:

Case1: I am using arguments in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its failing to mark the book mark on the end text and throwing a null pointer exception while retrieving the end bookmark. I can find the end text when manually searching in the word file. Now I am using a simple bookmark name ‘bmend1’(1- will be added in the FindAndInsertBookmark class).

You are facing this issue because the endText is not found in the document. Following is the text in the document. See the attached image for detail. Please use the correct regular expressions in Pattern.compile method to get the desired output.

Distribute securities as part of[!NonBreakingSpace!][!FieldStart!] HYPERLINK "http://fs.wiki.goto-psi.com/SecuritiesUnderwriting" [!FieldSeparator!]**securities underwriting**[!FieldEnd!]

renjimat:

Case 2: I am using arguments in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its throwing a null pointer exception now in the extractContent method on the line highlighted below, after the recent change suggested.

Please use the regular expressions according to your requirement to get the desired output. Please use “trading\.” in Pattern.compile method to fix this issue.

Pattern regex = Pattern.compile("trading\\.");