Getting texts in a word document which match a regular expression

ks.pavan · July 23, 2014, 6:38am

Dear Team,

We are using aspose.words in our applications and we have a requirement.

We would like to search the word document based on the input regular expression and retrieve the nodes which match the regular expression. Example is as below

|1.1|Sample|1.1|

|1.2|Text|1.2|

If I search with the regular expression |1.1||1.1|, I should get the node which match this pattern. Here anything means (.*) in regular expression language

Later, I would like to replace this text |1.1|Sample|1.1 with "Sample" which is in the middle.

Could you please let us know how this can be achieved using Aspose.words for java?

Thanks & Regards

Pavan

tahir.manzoor · July 24, 2014, 2:12am

Hi Pavan,

Thanks for your inquiry.

You can achieve your requirement by implementing IReplacingCallback interface. Please use the same approach shared at following documentation link to find and replace the text. See the highlighted code snippet below.

http://www.aspose.com/docs/display/wordsnet/How+to+Find+and+Highlight+Text

Please read following documentation link for your kind reference.
http://www.aspose.com/docs/display/wordsnet/Find+and+Replace+Overview

Hope this helps you. Please let us know if you have any more queries.

public void FindandReplace(Document doc, String regexString, String newText) throws Exception

{

ReplaceEvaluatorTest obj = new ReplaceEvaluatorTest();

obj.newText = newText;

Pattern regex = Pattern.compile(regexString, Pattern.CASE_INSENSITIVE);

doc.getRange().replace(regex, obj, false);

}

class ReplaceEvaluatorTest implements IReplacingCallback

{

public String newText;

@Override

public int replacing(ReplacingArgs e) throws Exception

{

// This is a Run node that contains either the beginning or the complete match.

Node currentNode = e.getMatchNode();

// The first (and may be the only) run can contain text before the match,

// in this case it is necessary to split the run.

if (e.getMatchOffset() > 0)

currentNode = splitRun((Run)currentNode, e.getMatchOffset());

// This array is used to store all nodes of the match.

ArrayList runs = new ArrayList();

// Find all runs that contain parts of the match string.

int remainingLength = e.getMatch().group().length();

while (

(remainingLength > 0) &&

(currentNode != null) &&

(currentNode.getText().length() <= remainingLength))

{

runs.add(currentNode);

remainingLength = remainingLength - currentNode.getText().length();

// Select the next Run node.

// Have to loop because there could be other nodes such as BookmarkStart etc.

do

{

currentNode = currentNode.getNextSibling();

}

while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));

}

// Split the last run that contains the match if there is any text left.

if ((currentNode != null) && (remainingLength > 0))

{

splitRun((Run)currentNode, remainingLength);

runs.add(currentNode);

}

DocumentBuilder builder = new DocumentBuilder((Document)currentNode.getDocument());

builder.moveTo((Run)runs.get(0));

builder.insertField(newText);

//Remove runs

for (Run run : (Iterable) runs)

{

run.remove();

}

// Signal to the replace engine to do nothing because we have already done all what we wanted.

return ReplaceAction.SKIP;

}

/**

* Splits text of the specified run into two runs.

* Inserts the new run just after the specified run.

*/

private Run splitRun(Run run, int position) throws Exception

{

Run afterRun = (Run)run.deepClone(true);

afterRun.setText(run.getText().substring(position));

run.setText(run.getText().substring((0), (0) + (position)));

run.getParentNode().insertAfter(afterRun, run);

return afterRun;

}