Parsing word files


#1

Hi

I have a need to to examine every word in a word document and format it depending on a certain condition. Or more specifically I need to look through a document and where I find the following:

<> some text in a word document <>

I need to bold the text in between the <><> tags and then remove these tags from the document completely. What would be the best way to approach this using Aspose.Word as I have looked at the API and cant figure out a way of achieving this.

Regards Michael


#2

If you read the recently published intro about the Aspose.Word Object Model in wiki, you might have better understanding of the subject.

With the current API here is what you need to do:

  1. Enumerate through Run nodes of the document.
  2. Examine Run.Text to see if it contains the text you are looking for, for example <>. Remove the tag by modifying the Run.Text property and set you state machine to be "in tag". Set Run.Font.Bold = true or any other formatting as you need.
  3. Continue examining further Run nodes to find the closing tag and keep modifying Run.Font properties so all text between the tags has the formatting you need. Remove the eng tag by modifying Run.Text.

This seemingly innocent requirement might not be very easy to cater for under some circumstances. The problem is that MS Word can break text into different runs at any position. For example, most of the time you would find <> somewhere in a string of Run.Text, but occassionablye it could come as <> in the next Run node. To properly handle such scenarious, your code would have to be a bit more complex than I described above. We plan to make Aspose.Word do this (join runs automatically where appropriate) in the future so you don't have to handle such complicated cases.

Let us know if you need more help.