Find and Replace Text values in Document


#1

I currently provide users with the ability to create master documents in word embedding special coded string within the document that represent data values. The codes are strings enclosed within square brackets (e.g. [CODE]). When the document is created parse the document and each time I find a value within square brackets I replace it with the appropriate data value.

At present I am using Word Automation but have a requirement to provide abililty for users to print these documents from the web. I am evaluating Aspose.Word & PDF for this and I like what I see but I am struggling to implement this with Apose.Word.

I have gone through the forum and saw a response which advised someone to look at an article titled "Working with Ranges". I have looked at this and I could do the search and replace for each code I have but as users only embed a fraction of the possible codes into any single document it is really inefficient trying to use the Range.Replace functionality for every possible code. I need to be able to use a regular expression e.g( \\[*\\] ) to identify all strings enclosed in [ ] and then evaluate each one as it is found, find the data value for that code and replace it.

Would really appreciate your assistance here.

Many Thanks

Jeff


#2

Hi Jeff,

Thanks for considering Aspose.Word.

To quickly find out what [] tags the given document has you can use Document.GetText method. It will return string containing all of the document text. Then you can use .NET Regex functions to find all [] tags, enumerate them and form a substitution dictionary. Finally, you can iterate through document paragraphs, making necessary substitutions.

But that's still a little cumbersome. Maybe you should consider using MailMerge. It is more apt for this task. The only difference in preparing master documents will be the insertion of mergefields instead of [] tags.


#3

Jeff,

I'm not suire which way is better but this is what I'm currently doing to achieve the functionality you desire (I do the same thing but look for fields contained inside <>)...

Aspose.Word.NodeList runs = doc.SelectNodes("//Run");

// file variable below is the path to the file you wish to open and find/replace
Document doc = new Document(file);

foreach (Aspose.Word.Run run in runs)
{
DocumentBuilder docBuilder = new DocumentBuilder(doc);

string field;

int fieldStart = runText.IndexOf("<<", 0);
int fieldEnd = runText.IndexOf(">>", 0);

if (fieldStart >= 0 && fieldEnd >= 0)
{
// Note: the + 2 and - 2 are 2's because <> are 2 chars long.
field = runText.Substring(fieldStart + 2, fieldEnd - fieldStart - 2);

// Code to get the value for field
string value = fooGetFieldValue(field);

run.Text = runText.Replace("<>", value);
}
}


#4
johnlsmith wrote:

Jeff,

I'm not suire which way is better but this is what I'm currently doing to achieve the functionality you desire (I do the same thing but look for fields contained inside <>)...

Aspose.Word.NodeList runs = doc.SelectNodes("//Run");

// file variable below is the path to the file you wish to open and find/replace
Document doc = new Document(file);

foreach (Aspose.Word.Run run in runs)
{
DocumentBuilder docBuilder = new DocumentBuilder(doc);

string field;

int fieldStart = runText.IndexOf("<<", 0);
int fieldEnd = runText.IndexOf(">>", 0);

if (fieldStart >= 0 && fieldEnd >= 0)
{
// Note: the + 2 and - 2 are 2's because <> are 2 chars long.
field = runText.Substring(fieldStart + 2, fieldEnd - fieldStart - 2);

// Code to get the value for field
string value = fooGetFieldValue(field);

run.Text = runText.Replace("<>", value);
}
}

I made a mistake when I copied and pasted... the first few lines should be (switched):

// file variable below is the path to the file you wish to open and find/replace
Document doc = new Document(file);

Aspose.Word.NodeList runs = doc.SelectNodes("//Run");


#5

One more thing I should mention, it is important that <> (in my case) is all of the same font and format. If say the inner text is bold and the enclosures aren't, such as <<field>>, then <> will be in different runs from field. In other words, <> will be composed of three runs: <>. This will result in <> not being found and therefore not replaced.

John Smith


#6

Thanks Vladimir,

I agree that mailmerge is a better option - and for a new version of our windows forms application we will certainly use this method to generate master documents. But - at this time I have to have some way to create documents which are compatible with the existing master documents that are used within the existing windows forms application. I will try out the suggestion provided by John and see how that works.

Is this functionality something that you are looking to add to the product in future?

Regards

Jeff


#7

John,

Thanks for your example. I also found a problem with the use of the Run where codes with numbers (e.g. [CODE1]) were split into 3 different runs as you were finding with different formatting. I also found that when there was more that 1 code within the run it was only replacing the first one.

I have reworked your example to overcome the problems I encountered and hopefully this should also deal with your formatting issue. Not sure it is is the most efficient option but it does seem to work.

Hasn't been extensively tested so no guarantees!! hope this is of use to you.

Cheers

Jeff

// filename is document to use as master
Document doc = new Document(filename);

// Hashtable to hold codes and replacement values where found
System.Collections.Hashtable replacementCodesAndValues = new System.Collections.Hashtable();

// Get the entire text value of the document
string docText = doc.GetText();
string field;
bool foundCode = true;
int startingIndex = 0;
int fieldStart = 0;
int fieldEnd = 0;

// Keep looking for matching codes within the document text
while (foundCode)
{
// Look for an instance of the code delimiter in the document text
fieldStart = docText.IndexOf("[", startingIndex);
// Found an opening delimiter so look for the ending which is after the
// starting point position
if(fieldStart >= 0)
{
fieldEnd = docText.IndexOf("]", startingIndex);
}

// Did we find a start and end point and therefore find a code?
if (fieldStart >= 0 && fieldEnd >= 0)
{
foundCode = true;
// Note: the + 1 and - 1 are 1's because [ and ] are 1 chars long.
field = docText.Substring(fieldStart + 1, fieldEnd - fieldStart - 1);

// Code to get the replacement value for field code
string replacementValue = GetValueForReplacementField(field);

// Did we find a value to replace this code with
if(replacementValue != null)
{
// Add the code and value to the hashtable of replacements
replacementCodesAndValues.Add(field, replacementValue);

// Replace the code within the text so we don't "find" it again
docText = docText.Replace("[" + field + "]", replacementValue);

// Reset the start point to start looking for the next replacement
// after the one we have just found
startingIndex = fieldStart + replacementValue.Length;
}
else
{// Didn't find a value for this code so start searching for the
// next code after this code
startingIndex = fieldStart + field.Length + 2;
}

}
else { foundCode = false;}
}
// Now we have the hashtable of all the codes we found and their replacement
// values. Use the Document.Range.Replace function to actually
// replace the text in the document
int replacementsMade = 0;
foreach(string code in replacementCodesAndValues.Keys)
{
replacementsMade = doc.Range.Replace(new System.Text.RegularExpressions.Regex("\\[" + code + "\\]"), replacementCodesAndValues.ToString());
}