How to get replace matching content

hlgao · January 6, 2021, 11:40am

How to get the matching content when using replace for regular matching.

Document WordDoc = new Document(“D:/new.docx”);
FindReplaceOptions options = new FindReplaceOptions();
WordDoc.getRange().replace(Pattern.compile(start[end]"), “”, options);

like this code,how to get the content in [start] and [end]?
thanks~

awais.hafeez · January 6, 2021, 3:52pm

@hlgao,

I think, you can implement the following workflow to get the desired results:

Find the node which represents the starting keyword i.e. [start]
Find the node which represents the ending keyword i.e. [end]
You can then use the code from following article to be able to extract content between start and end nodes:

Extract Selected Content Between Nodes

For 1, and 2, please use code like this:

Document doc = new Document("C:\\Temp\\start.docx");

FindReplaceOptions options = new FindReplaceOptions();
options.setDirection(FindReplaceDirection.BACKWARD);
ReplaceHandler handler = new ReplaceHandler();
options.setReplacingCallback(handler);

doc.getRange().replace(Pattern.compile("\\[start\\]"), "", options);
Node startNode = handler.node;

doc.getRange().replace(Pattern.compile("\\[end\\]"), "", options);
Node endNode = handler.node;

static class ReplaceHandler implements IReplacingCallback {
    public Node node = null;

    public int replacing(ReplacingArgs e) throws Exception {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run) currentNode, e.getMatchOffset());

        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while ((remainingLength > 0) && (currentNode != null) && (currentNode.getText().length() <= remainingLength)) {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do {
                currentNode = currentNode.getNextSibling();
            } while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0)) {
            splitRun((Run) currentNode, remainingLength);
            runs.add(currentNode);
        }

        node = (Run) runs.get(runs.size() - 1);

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.STOP;
    }

    /**
     * Splits text of the specified run into two runs. Inserts the new run just
     * after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception {
        Run afterRun = (Run) run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

Please let me know if I can be of any further assistance.

hlgao · January 7, 2021, 9:20am

Thank you for your patience.
This can get the desired results, but feel the code is more complex.
If can match the result like a regular expression of text, you can get the matching result directly

awais.hafeez · January 7, 2021, 4:55pm

@hlgao,

If you want to get text representation of content between [start] and [end] tags, then please use the following Java code:

Document doc = new Document("C:\\Temp\\start.docx");

FindReplaceOptions options = new FindReplaceOptions();
options.setDirection(FindReplaceDirection.BACKWARD);
ReplaceHandler handler = new ReplaceHandler();
options.setReplacingCallback(handler);

doc.getRange().replace(Pattern.compile("\\[start\\](.*?)\\[end\\]"), "", options);

static class ReplaceHandler implements IReplacingCallback {

    public int replacing(ReplacingArgs e) throws Exception {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run) currentNode, e.getMatchOffset());

        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while ((remainingLength > 0) && (currentNode != null) && (currentNode.getText().length() <= remainingLength)) {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do {
                currentNode = currentNode.getNextSibling();
            } while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0)) {
            splitRun((Run) currentNode, remainingLength);
            runs.add(currentNode);
        }

        System.out.println(e.getMatch().group(1).trim());

        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.STOP;
    }

    /**
     * Splits text of the specified run into two runs. Inserts the new run just
     * after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception {
        Run afterRun = (Run) run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

hlgao · January 8, 2021, 6:01am

@awais.hafeez
Thank you very much.
But if the end character is in other paragraphs, the program stops after the end of the beginning character paragraph.

awais.hafeez · January 8, 2021, 10:22am

@hlgao,

Please ZIP and upload your input Word document (you are getting this problem with) here for testing. We will then investigate the issue further on our end and provide you more information.

hlgao · January 9, 2021, 10:28am

@awais.hafeez
Please see this document.the end character is in next paragraphs.
01.zip (12.0 KB)
thanks

awais.hafeez · January 9, 2021, 6:38pm

@hlgao,

Instead of writing matched string to Java’s console, please try to write it in separate text or .docx file:

static class ReplaceHandler implements IReplacingCallback {

    public int replacing(ReplacingArgs e) throws Exception {

        DocumentBuilder builder = new DocumentBuilder();
        builder.write(e.getMatch().group(1));
        builder.getDocument().save("C:\\Temp\\01\\21.1.docx");

        // System.out.println(e.getMatch().group(1).trim());

        return ReplaceAction.STOP;
    }
}