Extracting Specific Text From Word Doc

Hi,

We want to use Aspose.Words to extract specific text from some Word documents. The documents are formatted as follow: (also see attached file)


First Name: Suzanne
Last Name: Test

Home Phone: 905-123-4567
Work Phone: 905-234-5678
Other Phone: 905-333-2222
Email: abcd@test.ca
Comments: Oct/07: Comment line1.
Aug 10/07: Comment line2.
Aug 12/03: Comment line3

Some more text


Is there any way to search for First Name: and retreive the text “Suzanne”, search for Comments: and retreive the 3 lines of text, and etc.?

Thanks for your help.

Dave

Hi

Thank you for your interest in Aspose.Words. I think that you can achieve this using regular expressions and ReplaceEvaluator. For example see the following code. This code extracts first name.

public void TestReplaceEvaluator_109307()
{
    //Open document
    Document doc = new Document(@"458_109307_queuesystems\in.doc");
    //Create regular expression
    Regex regex = new Regex(@"First Name:(?.\*?)\r");
    //Find string
    doc.Range.Replace(regex, new ReplaceEvaluator(ReplaceAction_109307), true);
}

static ReplaceAction ReplaceAction_109307(object sender, ReplaceEvaluatorArgs e)
{
    //Get First name from document
    string firstName = e.Match.Groups["value"].Value;
    return ReplaceAction.Skip;
}

The following Regex you can use for extracting comments.

Regex regex = new Regex(@"Comments:(?.\*?)\f");

As you can see “\r” – paragraph break character, and “\f” – page break character.

I hope that this will help you.

Best regards

Thanks for the quick response!

Your code work great!

Hi Alexey,

Do you know how I can setup the regex from Comments: to the end of the document? Is there an end of document special char?

Regex regex = new Regex(@"Comments:(?.\*?)\END?");

Thanks

Dave

Hi

I think that you can try using the following Regex.

Regex regex = new Regex(@"Comments:(?.\*)");

Hope this helps.

Best regards.