Splitting document based on start and end text

renjimat · October 21, 2016, 8:12am

Thank you very much Tahir. The code works for me. However, its not including the end text in the new document. What should I do for that?

tahir.manzoor · October 24, 2016, 1:08am

Hi Renjith,

Thanks for your inquiry. In this case, you need to insert the second bookmark after the “end text”. You can use BookmarkStart and BookmarkEnd classes to create a bookmark and use CompositeNode.InsertAfter method to insert it after the “end text” Run node.

Please let us know if you have any more queries.

renjimat · October 24, 2016, 7:58am

Hi Tahir,

Are you referring to the below method? Its already using insertAfter method. Please guide me if I am wrong.

/**
*
* Splits text of the specified run into two runs.
*
* Inserts the new run just after the specified run.
*
*/
private Run splitRun(Run run, int position) throws Exception
{
    Run afterRun = (Run) run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring((0), (0) + (position)));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}

Thanks.

tahir.manzoor · October 24, 2016, 10:36am

Hi Renjith,
Thanks for your inquiry. Please try the following modified code snippet. Hope this helps you.

public int replacing(ReplacingArgs e) throws Exception {
    // This is a Run node that contains either the beginning or the complete match.
    Node currentNode = e.getMatchNode();

    // The first (and may be the only) run can contain text before the match,
    // in this case it is necessary to split the run.
    if (e.getMatchOffset() > 0)
        currentNode = splitRun((Run) currentNode, e.getMatchOffset());

    ArrayList runs = new ArrayList();

    // Find all runs that contain parts of the match string.
    int remainingLength = e.getMatch().group().length();
    while ((remainingLength > 0) && (currentNode != null) && (currentNode.getText().length() <= remainingLength)) {
        runs.add(currentNode);
        remainingLength = remainingLength - currentNode.getText().length();

        // Select the next Run node.
        // Have to loop because there could be other nodes such as BookmarkStart etc.
        do {
            currentNode = currentNode.getNextSibling();
        }
        while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
    }

    // Split the last run that contains the match if there is any text left.
    if ((currentNode != null) && (remainingLength > 0)) {
        splitRun((Run) currentNode, remainingLength);
        runs.add(currentNode);
    }

    Document doc = (Document) e.getMatchNode().getDocument();
    DocumentBuilder builder = new DocumentBuilder(doc);

    if (bookmark == "bmend") {
        BookmarkStart bs = new BookmarkStart(doc, bookmark);
        BookmarkEnd be = new BookmarkEnd(doc, bookmark);
        Run run = (Run) runs.get(runs.size() - 1);

        run.getParentParagraph().insertAfter(bs, run);

        bs.getParentNode().insertAfter(be, bs);
    } else {

        builder.moveTo((Run) runs.get(0));
        builder.startBookmark(bookmark);
        builder.endBookmark(bookmark);
    }

    // Signal to the replace engine to do nothing because we have already done all what we wanted.
    return ReplaceAction.SKIP;
}

renjimat · October 24, 2016, 10:02pm

Thank you for your quick help. Its working fine. I’ll let you know if any more help needed.

tahir.manzoor · October 25, 2016, 5:56am

Hi Renjith,

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

renjimat · November 24, 2016, 1:23am

Hi Tahir,

I would need an addition to the above logic. Suppose we would have multiple instances of starting and ending text’s. So, how do we insert a book mark suppose on the 2nd or 3rd repetition of the text, if we provide the repetition number?

Thanks in advance!

tahir.manzoor · November 25, 2016, 1:07am

Hi Renjith,

Thanks for your inquiry. In this case we suggest you following solution.

Please add an integer member in FindAndInsertBookmark class
Increment its value with 1 in IReplacingCallback.replacing method
Add bookmarks in document with name bmstart1, bmstart2 and so on. Do the same for bmend bookmark.
Extract the contents between desired bookmarks.

If you still face problem, please share your input and expected output documents here for our reference. We will then provide you more information about your query.

renjimat · November 25, 2016, 6:26am

Hi Tahir,

Thanks for your input. I am not clear with your solution. Suppose if I add an integer in the FindAndInsertBookmark class, it will get reset every time when I initialize it with a book mark right?

I am giving a sample input file here for your reference. My Bookmark start should be the word ‘In response to’, which is repeating twice. Bookmark end should be ‘counterparty risk’ which is repeating four times. I would need to extract content between 2nd instance of ‘In response to’ and 4th instance if ‘counterparty risk’. The repeat instances would be dynamic case to case. Can you please help me with this?

Thanks.

tahir.manzoor · November 28, 2016, 4:37am

Hi Renjith,

Thanks for your inquiry. Please use the following modified code example to insert the bookmarks for multiple instances of ‘In response to’ and ‘counterparty risk’ into document. Please check the attached image for bookmark’s detail. Once you have inserted the bookmarks in the documents, you can extract the contents between two bookmarks as you are already doing in your code. Hope this helps you. Please let us know if you have any more queries.

class FindAndInsertBookmarks implements IReplacingCallback
{
    String bookmark;
    int i = 1;
    FindAndInsertBookmarks(String bm)
    {
        bookmark = bm;
    }
    public int replacing(ReplacingArgs e) throws Exception
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();
        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() >0)
            currentNode =splitRun((Run)currentNode, e.getMatchOffset());
        // This array is used to store all nodes of the match for further highlighting.
        ArrayList runs = new ArrayList();
        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while ((remainingLength >0) && (currentNode != null ) && (currentNode.getText().length() <= remainingLength))
        {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();
            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            }
            while
            ((currentNode != null ) && (currentNode.getNodeType() != NodeType.RUN ));
        }
        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0 ))
        {
            splitRun((Run)currentNode,
                    remainingLength);
            runs.add(currentNode);
        }
        Document doc = (Document)e.getMatchNode().getDocument();
        DocumentBuilder builder = new DocumentBuilder(doc);
        if(bookmark == "bmend" )
        {
            BookmarkStart bs = new BookmarkStart(doc, bookmark + i);
            BookmarkEnd be =new BookmarkEnd(doc, bookmark +i);
            Run run =(Run)runs.get(runs.size() -1);
            run.getParentParagraph().insertAfter(bs, run);
            bs.getParentNode().insertAfter(be, bs);
            i++;
        }
        else
        {
            builder.moveTo((Run)runs.get(0));
            builder.startBookmark(bookmark + i);
            builder.endBookmark(bookmark + i);
            i++;
        }
        // Signal to the replace engine to do nothing because we havealready done all what we wanted.
        return ReplaceAction.SKIP;
    }

    /**
     * Splits text of the specified run into two runs.
     * Inserts the new run just after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception
    {
        Run afterRun = (Run)run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

//Load in the document
Document doc = new Document(MyDir + "09_Securities+Trading+Regulations.docx");

DocumentBuilder builder = new DocumentBuilder(doc);

//Find text and insert bookmark for sart text
Pattern regex = Pattern.compile("in response to", Pattern.CASE_INSENSITIVE);

FindReplaceOptions options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmarks("in response to");

doc.getRange().replace(regex, "", options);

//Find text and insert bookmark for ending text
regex = Pattern.compile("counterparty risk", Pattern.CASE_INSENSITIVE);

options = new FindReplaceOptions();
options.ReplacingCallback = new FindAndInsertBookmarks("counterparty risk");

doc.getRange().replace(regex, "", options);

renjimat · December 6, 2016, 3:50am

Hi Tahir,

I found and issue to add a bookmark start on the attached document. My start bookmark is at the beginning of the document. It is word “INTRODUCTION” in the attached document. I am doing a case sensitive search.

//Find text and insert bookmark for starting text
Pattern regex = Pattern.compile("(?)"+"INTRODUCTION");

But it is not able to insert the bookmark. using the code above in the thread. Can you please help me out?

Thanks.

renjimat · December 6, 2016, 4:04am

Hi Tahir,
In addition to the above request, Can we include headers and footers while adding a bookmark? i.e, add a bookmark at header text of the 1st page and end book mark at footer text of the second page/
Thanks.

tahir.manzoor · December 7, 2016, 1:21am

Hi Renjith,

Thanks for your inquiry.
renjimat:
I found and issue to add a bookmark start on the attached document. My start bookmark is at the beginning of the document. It is word “INTRODUCTION” in the attached document. I am doing a case sensitive search.
Please set the value of FindReplaceOptions.MatchCase property to true if you want case-sensitive comparison.

In your case, the word “INTRODUCTION” is not in capital letters. The font is formatted as small capital letters. You can get/set the value of this formatting using Font.SmallCaps.
renjimat:
In addition to the above request, Can we include headers and footers while adding a bookmark? i.e, add a bookmark at header text of the 1st page and end book mark at footer text of the second page
You can use DocumentBuilder.MoveToHeaderFooter method to move the cursor to the beginning of a header or footer in the current section and insert the desired contents.

Please note that HeaderFooter is a section-level node and can only be a child of Section. There can only be one HeaderFooter or each HeaderFooterType in a Section. In your case, you need to insert the section break at the end of first page and insert the contents in the header/footer of sections.

If you face any issue, please share your expected output document. We will then provide you more information about your query along with code.

renjimat · December 7, 2016, 2:56am

Hi Tahir,
Please find the code attached which I am using to split word document based on start and end text phrases. I will also use the text sequence number as well to add book mark in the correct text, in case the text is repeating in the document. You can find the artifacts in the attachment queries.zip.
Classes: AsposeWordChunkImpl.java, FindAndInsertBookmark.java
Input File: 09_Securities Trading Regulations.docx
Case1: I am using arguments in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its failing to mark the book mark on the end text and throwing a null pointer exception while retrieving the end bookmark. I can find the end text when manually searching in the word file.
Case 2: I am using arguments in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its throwing StringIndexOutOfBoundsException when running the splitRun method in FindAndInsertBookmark class.
Can you please check these issues and help me?
Thanks in advance.

tahir.manzoor · December 8, 2016, 1:26am

Hi Renjith,

Thanks for your inquiry.

renjimat:

Case1: I am using arguments in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its failing to mark the book mark on the end text and throwing a null pointer exception while retrieving the end bookmark. I can find the end text when manually searching in the word file.

Please note that Aspose.Words mimics the same behavior as MS Word does. There are a few restrictions on Bookmark names e.g. the name must start with a word character (but not a digit) then any Unicode word character may follow up to an overall length of 40 characters. Microsoft Word does not support white spaces and punctuation of any kind in Bookmark’s name.

In this case, the length of end text (Distribute securities as part of securities underwriting) should be up to 40 characters. You need to pass text value in FindAndInsertBookmarks’ constructor that have length less then 40 characters. Or you may trim the bookmark’s name before inserting it.

renjimat:

Case 2: I am using arguments in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its throwing StringIndexOutOfBoundsException when running the splitRun method in FindAndInsertBookmark class.

Please use following modified splitRun method to fix this issue.

private Run splitRun(Run run,     int position) throws Exception
{
    if(run.getText().length()< position)
        position =run.getText().length();
    Run afterRun = (Run)run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring(0, position));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}

renjimat · December 8, 2016, 2:45am

Hi Tahir,

I made the suggested changes. But still facing issues. Can you please help?

Case1: I am using arguments
in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method.
Its failing to mark the book mark on the end text and throwing a null
pointer exception while retrieving the end bookmark. I can find the end
text when manually searching in the word file. Now I am using a simple bookmark name ‘bmend1’(1- will be added in the FindAndInsertBookmark class).

Case 2: I am using arguments
in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument
method. Its throwing a null pointer exception now in the extractContent method on the line highlighted below, after the recent change suggested.

// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.getNextSibling() == null && isExtracting)
{
    // Move to the next section.
    Section nextSection = (Section)currNode.getAncestor(NodeType.SECTION).getNextSibling();
    currNode = nextSection.getBody().getFirstChild();
}
else
{
    // Move to the next node in the body.
    currNode = currNode.getNextSibling();
}

Attached my classes and artifacts in the queries.zip

Thanks.

tahir.manzoor · December 9, 2016, 3:36am

Hi Renjith,

Thanks for your inquiry.

renjimat:

Case1: I am using arguments in ‘param1.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its failing to mark the book mark on the end text and throwing a null pointer exception while retrieving the end bookmark. I can find the end text when manually searching in the word file. Now I am using a simple bookmark name ‘bmend1’(1- will be added in the FindAndInsertBookmark class).

You are facing this issue because the endText is not found in the document. Following is the text in the document. See the attached image for detail. Please use the correct regular expressions in Pattern.compile method to get the desired output.

Distribute securities as part of[!NonBreakingSpace!][!FieldStart!] HYPERLINK "http://fs.wiki.goto-psi.com/SecuritiesUnderwriting" [!FieldSeparator!]**securities underwriting**[!FieldEnd!]

renjimat:

Case 2: I am using arguments in ‘param2.txt’ to run the AsposeWordChunkImpl.chunkWordDocument method. Its throwing a null pointer exception now in the extractContent method on the line highlighted below, after the recent change suggested.

Please use the regular expressions according to your requirement to get the desired output. Please use “trading\.” in Pattern.compile method to fix this issue.

Pattern regex = Pattern.compile("trading\\.");

renjimat · December 15, 2016, 6:47am

Hi Tahir,
Thanks for your explanation. I understood now why it’s not able to find the text using aspose.
I will share a little background about our application. We have a content analyzer engine and we provide the content of the whole document as input to it. The engine will logically divide the document based on the content and provide us the start and end boundaries of each logical section in plain text. We plan to use ASPOSE api to physically divide the parent document into small child documents, by adding bookmarks on boundaries and extract content between the bookmarks.
The boundaries provided by the engine are plain text and we do not know any phrase is having hyperlink in it or not. If we know the words contains hyperlinks, we can use regex to exclude that part while matching. So, we are facing difficulties when the boundaries contains hyperlinks. There maybe cases that a boundary may contain multiple hyperlinks in it.
I would like to know if there is any way in ASPOSE to find text as its appearing of the screen. For e.g., in the attached document above, the text displaying in word as “Distribute securities as part of securities underwriting” but in actual it is "Distribute securities as part of { HYPERLINK "http://fs.wiki.goto-psi.com/SecuritiesUnderwriting" }. Is there a way to find the text by just providing “Distribute securities as part of securities underwriting” as the search string – as we can do it in the word editor?
Appreciate your quick reply. Thanks.

tahir.manzoor · December 16, 2016, 2:18am

Hi Renjith,

Thanks for your inquiry. In your case, we suggest you please iterate through the paragraphs of document and check if paragraph’s text contains the endText. If yes, insert the bookmark at the start/end of paragraph according to your requirements. Please check the following code example. Hope this helps you.

Document doc = new Document(MyDir + "09_Securities+Trading+Regulations.docx");
doc.getRange().replace(ControlChar.NON_BREAKING_SPACE, " ", new FindReplaceOptions());
String text= "Distribute securities as part of securities underwriting";
for (Paragraph paragraph : (Iterable)doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().contains(text))
    {
        BookmarkStart bs = new BookmarkStart(doc, "bookmark");
        BookmarkEnd be = new BookmarkEnd(doc, "bookmark");
        Run run = paragraph.getRuns().get(paragraph.getRuns().getCount() - 1);
        run.getParentParagraph().insertAfter(bs, run);
        bs.getParentNode().insertAfter(be, bs);
        break;
    }
}

alexey.noskov · March 24, 2023, 5:28pm

A post was split to a new topic: The process freezes when call doc.ProcessParagraph()