Extracting specific word texts revisited

Kai_Iske · February 14, 2005, 2:49am

Roman,

I just came back to Aspose.Word 2.1.9. as we are currently planning on purchasing/upgrading and think about using Aspose.Word also in another project but the one I initially started evaluating Aspose.Word with.

Here’s a recap of what I attempt to do: Extract text formatted using a specific style (name) and get this text as plaing ASCII/UTF text.

These are the problems I’m currently/still facing:

- I get FORMFIELD placeholders within the extracted text
- When changes are being tracked in the Word document, I will also “read” deleted text

I have sent you an example project including an example Document file (the document template). When you compile and run this sample, try the following:

- Set a breakpoint on line 47 of Class1.cs
- Run the app
- The app will launch Word editing the generated document
- Enable change tracking and delete the text in the translation box (at the bottom)
- Type some new text in the translation box
- Quit word

When the breakpoint is hit, you will see the extracted text is somewhat “incorrect”.

I’m a bit lost at the moment as to how to tackle the problem. Any suggestions/hints would be highly appreciated

Thanks very much in advance

Regards

Kai Iske

romank · February 16, 2005, 10:41am

Word documents contains a lot of special characters.

When there is a field in a document such as a text box or any other field you can get:
\x13 - start of field character
field code
\x14 - field separator (optional for some fields)
field value
\x15 - end of field

Since you are using IDocumentVisitor, it is by design that you get all of these. You need to implement a simple state machine to filter out unwanted characters and field codes if you don't want them in your output, maybe something like this:

switch (fieldChar)

{

case WordChar.FieldBeginChar:

gotSeparator = false;
Lock();

break;

case WordChar.FieldSeparatorChar:

gotSeparator = true;
Unlock();

break;

case WordChar.FieldEndChar:

//If the field has separator, the output was already unlocked.

if (!gotSeparator)

Unlock();

break;

default:

throw new Exception("Unknown field char.");
}

Regarding deleted or changed text, you are right, it should not make its way to the output and this will be fixed.

Kai_Iske · February 16, 2005, 10:37pm

Roman,

thanks for the update. I appreciate your support as I continously nag concerning the issue Smile

However when you run the sample I sent you, you will notice the following text being extracted (apart from the track changes et al)

Evaluation Only. Created with Aspose.Word. Copyright 2003-2004 Aspose Pty Ltd. Note: Random text in the document is also part of the evaluation watermark. FORMULARTEXT Original Text This is a testfgsdfsdf

Of course, the text in blue is the one coming from Aspose.Word as I’m still evaluating.
The text in black is the one I’m expecting. Notice the text in red though. The text “Original Text This is a test” gets set on a form field. Then I have entered the fgsdfsdf text after the form field. When I extract, I do handle states (Word special characters). However I always get the plain text FORMULARTEXT (which translates to FORMTEXT). Of course I could extract this piece of text, but what if my application recieves a Word document created using a french, spanish, polish or else version of Word? Then I would need to add this specific text to my exception list.

Any hint on this would be appreciated.

Regards

Kai

romank · February 20, 2005, 1:46pm

IDocumentVisitor will have special methods that will be called for start, separator and end of field characters. When you get start of field, it will contain FieldType enumeration that is language neutral.

Kai_Iske · February 20, 2005, 11:23pm

Roman,

thanks for the update. This sounds good to me. Looking forward to the update Smile

Regards

Kai

romank · February 21, 2005, 2:05pm

See new DocumentVisitor.VisitFieldStart, VisitFieldSeparator and VisitFieldEnd methods in Aspose.Word 2.1.12

Kai_Iske · February 21, 2005, 10:42pm

Roman,

this is perfect. Using the new FieldStart, FieldSeperator, FieldEnd state indicators, I am now able to extract the the text fine, omitting field codes.

I won’t ask for a prediction when to expect the “Track Changes” handling Smile

Thank you very much for the update

Regards

Kai