I just came back to Aspose.Word 2.1.9. as we are currently planning on purchasing/upgrading and think about using Aspose.Word also in another project but the one I initially started evaluating Aspose.Word with.
Here’s a recap of what I attempt to do: Extract text formatted using a specific style (name) and get this text as plaing ASCII/UTF text.
These are the problems I’m currently/still facing:
- I get FORMFIELD placeholders within the extracted text
- When changes are being tracked in the Word document, I will also “read” deleted text
I have sent you an example project including an example Document file (the document template). When you compile and run this sample, try the following:
- Set a breakpoint on line 47 of Class1.cs
- Run the app
- The app will launch Word editing the generated document
- Enable change tracking and delete the text in the translation box (at the bottom)
- Type some new text in the translation box
- Quit word
When the breakpoint is hit, you will see the extracted text is somewhat “incorrect”.
I’m a bit lost at the moment as to how to tackle the problem. Any suggestions/hints would be highly appreciated
Thanks very much in advance
Word documents contains a lot of special characters.
When there is a field in a document such as a text box or any other field you can get:
\x13 - start of field character
\x14 - field separator (optional for some fields)
\x15 - end of field
Since you are using IDocumentVisitor, it is by design that you get all of these. You need to implement a simple state machine to filter out unwanted characters and field codes if you don't want them in your output, maybe something like this:
gotSeparator = false;
gotSeparator = true;
//If the field has separator, the output was already unlocked.
throw new Exception("Unknown field char.");
Regarding deleted or changed text, you are right, it should not make its way to the output and this will be fixed.
thanks for the update. I appreciate your support as I continously nag concerning the issue
However when you run the sample I sent you, you will notice the following text being extracted (apart from the track changes et al)
Evaluation Only. Created with Aspose.Word. Copyright 2003-2004 Aspose Pty Ltd. Note: Random text in the document is also part of the evaluation watermark. FORMULARTEXT Original Text This is a testfgsdfsdf
Of course, the text in blue is the one coming from Aspose.Word as I’m still evaluating.
The text in black is the one I’m expecting. Notice the text in red though. The text “Original Text This is a test” gets set on a form field. Then I have entered the fgsdfsdf text after the form field. When I extract, I do handle states (Word special characters). However I always get the plain text FORMULARTEXT (which translates to FORMTEXT). Of course I could extract this piece of text, but what if my application recieves a Word document created using a french, spanish, polish or else version of Word? Then I would need to add this specific text to my exception list.
Any hint on this would be appreciated.
IDocumentVisitor will have special methods that will be called for start, separator and end of field characters. When you get start of field, it will contain FieldType enumeration that is language neutral.
thanks for the update. This sounds good to me. Looking forward to the update
See new IDocumentVisitor.FieldStart, FieldSeparator and FieldEnd methods in Aspose.Word 2.1.12, http://aspose.com/blogs/Roman.Korchagin/archive/2005/02/22/528.aspx
this is perfect. Using the new FieldStart, FieldSeperator, FieldEnd state indicators, I am now able to extract the the text fine, omitting field codes.
I won’t ask for a prediction when to expect the “Track Changes” handling
Thank you very much for the update