Searching for text and working with results

therman · April 21, 2008, 5:42pm

My current use for Aspose.Word is in reading Word files and extracting pieces of information from them. I can not send you a copy of the documents I am dealing with but they will have contents like this:

Field1: AB1111-11
Field2: AB-11111;something something something
Field3: 01/01/1001
Field4:
Some data
Some More data

The idea is that I know what the field names are but the structure of the document is not standard. I need to be able to search through the document looking for the field names. I know how to do this using Range.Replace and using a ReplaceEvaluator. Where I need the advice is how I can get the line of text that follows the field, which is what I really need. Most fields are organized like my sample fields 1-3. Occassionaly I will run into fields like #4 where the text I want are the lines following the field.
If it helps limiting the results and helps you provide the advice I need, the fields are all part of a single cell in a table.
I have been a developer for a long time so you should only need to point me in the correct direction before I get the idea. This is a very urgent matter and any advice you can offer would be greatly appreciated.
Thanks,
Todd

Klepus · April 22, 2008, 2:12am

Hello Todd!
Thank you for your inquiry.
That’s very nice to get questions from experienced developers since you won’t ask me to write the whole program for you J
I expect these fields are not regular MS Word form fields. They are just fragments of text prefixed with some constant labels line “First name:”, “Surname:”, “Age:” etc. Since you are familiar with find/replace functionality that’s easy to find them in the document.
After you find a particular label all the remaining paragraph text belongs to the corresponding value. If soft line breaks are used then you should check them too. Paragraph object in MS Word and Aspose.Words typically consists of Run objects. Run is a piece of text having individual set of attributes. Traverse the Runs in the Paragraph containing the label you found and collect text from them. Note that in general label can end in the middle of a Run. Another way is getting text from the whole Paragraph (ToTxt method) and cutting the label.
Regards,

therman · April 22, 2008, 7:46am

Thank you for the response. You are correct in that they are not actual form fields. That would make my life a little too easy for them, I suppose. I looked at it more last night and was leaning towards looping through all the subsequent Run objects. I believe this is basically what you are suggesting.
The only additional question is what “label” you are referring to. The lines all end in linefeeds which I believe is represented by “/v”. I think I noticed you had a ConrtolChars enumeration which contains this so are you suggesting to look for this?
When I find the text I am looking for, is there some methodology I can use to only return specific nodes (Runs) from that point? If I loop through the siblings I may get more objects then what I want (or cross paragraphs).
Thanks,
Todd

Klepus · April 22, 2008, 8:52am

Hello!
I’m referring to text preceding values that you are extracting. For instance, if you have a line like this:
Name: John
then "Name: " is a label and “John” is a value. You can call them however you like. In any case we mean the part of line indicating the purpose of value but not including value itself.
If the lines are separate paragraphs then you don’t need finding any line feeds. Paragraph class does everything for you. Soft line breaks are represented by this character:

/// 
/// Line break character: (char)11 or "\v"
/// 
public const char LineBreakChar = (char)11;

You can try searching it in the runs if you know that line breaks could occur in the document. Note that if document format is not standardized this could happen among with other irregularities.
Have you tried using Paragraph.ToTxt()? Maybe you won’t need traversing runs at all. But line breaks inside paragraph should get here too.
Regards,

therman · April 22, 2008, 9:11am

Ahhh. Sorry about that. I thought you were referring to something within Aspose.Words. You seem to fully understand what I am trying to do. I have a known field and I need the value.
I will test Paragraph.ToTxt() and see what that leaves me with. I may be able to split the lines on the linebreaks and work with it that way. Between that and traversing the Run objects, that gives me two methods. I will test them both out.
Thanks,
Todd