Hints for using Aspose.Words from BizTalk

hsiegel · May 14, 2008, 3:07pm

Anyone have any hints, tips, and/or gotchas about using Aspose.Words from BizTalk, either from a receive pipeline component or inside an orchestration to process incoming .doc files?

Thanks.

- h

romank · May 14, 2008, 10:57pm

Aspose: well, Aspose.Words is just a .NET class library. I'm only vaguely familiar with biztalk. but if you can use a .NET class library then you can use Aspose.Words. what exactly do you want to do? we have shown in the past how to successfully invoke Aspose.Words from SharePoint to do document conversions for example. I can imagine something can be done for biztalk too.

Customer: We are getting Word document with tables of data (invoicing, inventory, etc) and have to extract the data from the tables an insert it in to another system (either by direct insert/modify/etc to a database or by using a web service). We would use Aspose.Words in the receiving BizTalk pipeline to extract the data and emit our own XML schemas of the relavent data.

Aspose: it should not be a problem, but you will need to write code. aspose.words is a library you can use to open a document and get access to all its content elements and formatting programmatically. you then can go throug hthe content, such as tables and rows and insert into a database. or emit your own xml. or convert to ooxml for example or any other format.

Customer: Obviously, yes I would need to write code. At first we were going to have the documents saved out in XML (WordML) and process that ourselves. But that would just be writing a library much like Aspose.Words. Why reinvent the wheel. And asking for the final document to be saved in XML is probably not going to fly

Aspose: yes extracting data from WordML could be hard. especially if you get things like Hello. and you do get them in word documents.

Customer: Yes. I've have saved out a sample document in XML and it is ugly. I would have a decent amount of work to parse out what we need. Though, luckily, there is very little syle and formating information in our files, at least not in the data in the table cells, so having to find/merge runs would not be that much of an issue.

If there are any specific resources that might be useful, can you please put pointers to them in the thread in the forum.

Aspose: you should look at implementing a DocumentVisitor probably. here is one example of building a doc2txt http://www.aspose.com/documentation/file-format-components/aspose.words-for-.net-and-java/aspose.words.documentvisitor.html you derive from DocumentVisitor and override VisitTableStart, VisitRowStart and so on. to better understand object model run the DocumentExplorer demo and open your document in it. you will see the document as a tree exactly how it would be in memory in aspose.words.document. probably it is done using visitor too.

Customer: Using the explorer is the first thing I was going to do tomorrow.THanks.

hsiegel · May 15, 2008, 12:08am

Thanks Roman.

There was one question I didn’t ask before I had to pack up… The documents we’ll be getting have “track changes” turned on and we might need to get at the original text. How difficult is it detect if text has been modified and to get back the original text? What about detecting a deleted or added table row and getting back the deleted row?

- h

romank · May 15, 2008, 12:52am

Tracking changes are generally supported in Aspose.Words.

I think the current model in Aspose.Words is Before + Changes. You can call Document.AcceptAllRevisions so the model becomes After. Also see http://www.aspose.com/documentation/file-format-components/aspose.words-for-.net-and-java/manage-tracking-changes.html

So if you walk the document tree before calling Document.AcceptAllRevisions you will see all deleted elements in the document. If you walk the document tree after AcceptAllRevisions you will not see deleted things.

hsiegel · May 15, 2008, 1:18am

Unless I am misunderstanding your response, what you’ve said is how it works when Aspose.Words is being used to edit a document. Is that correct?

Assuming my reading of your comments is correct, then my question stands…

In my case the document was created by another system then hand editted with Word with track changes turned on, after which I will get the document to process. My processing consists of only reading the document and extracting the table data. I will not be editting the document and I will not be saving the document.

In a read-only mode on an existing document which has already been editted in Word with changes tracked, can I detect where there have been changes and get the original text as well as the current text, as well as detect when a row has been deleted (and get the data from that now deleted row)?

To wit, if I save the document out of Word as XML (WordML), I can see where text has changed because it is wrapped in a aml:annotation tag with a w:type attribute of “Word.Deletion”, and the original text is wrapped in w:delText tag inside the annotation. If I was processing the WordML, I could detect those tags and recover the original text as well as the current text. Can that type of operation be done in Aspose.Words?

I have not yet tackled how I might handle the delete row or insert row scenarios as I stopped short of my original plan of processing the WordML in order to investigate whether Aspose.Words can give me all the functionality I need so I don’t have to write my own library.

- h

romank · May 15, 2008, 1:32am

Read the article I've given in my previous post. You will probably get an idea of what Aspose.Words can do.

Do you actually need to detect individual deletions and insertions or not?

I assumed you do not need individual changes and you only need a document as it was before the changes were made or after they've been accepted. Both are possible with Aspose.Words. When the document is loaded it is "before" and if you call AcceptAllRevisions it becomes "after". It accepts changes made in MS Word while the trackign changes were on. It has nothing to do with editing document by Aspose.Words (these are not tracked by the way).

But if you really need to detect individual deletions and insertions you still can do that. You need to look at Run and Paragraph nodes IsInsertRevision and IsDeleteRevision properties. If IsInsertRevision is true - it is an inserted text, but not yet accepted. If IsDeleteRevision is true - it is deleted text, but not yet accepted. To detect of a table row has been deleted you need to see if all paragraphs in all cells of the row have IsDeleteRevision = true. To detect if a complete table was deleted you need to check the above condition in all rows of the table.

hsiegel · May 15, 2008, 1:52am

Thanks. I did misunderstand. My bad!

So if I’m reading the on-line documentation correctly, for any field which has been changed (editted, deleted, added) would I need to process the document node tree twice? Once without accepting changes to get the original text and then again with changes accepted, and then basically compare the before/after results.

I am actually still trying to get clarification on this issue, so I don’t know for sure yet whether I need to replace all the existing data with new data from the document if there have been changes, or I’ll need to only replace the data fields that changed and if so would I need both the original data and the new data or not.

Right now I’m just trying to figure out what I can do with the product to know if it will be able to do everything I need or not, once the users come back with their requirements.

romank · May 15, 2008, 3:38am

Here is basically what you can do:

Detect if an element is a delete revision or insert revision or normal text and act accordingly.
Accept all revisions in a document.

If you let me know what do you want to do about revisions in a document (not in technical Aspose.Words terms, but in business sense) I can tell how.

hsiegel · May 15, 2008, 12:17pm

Thanks for all the help, but it now looks like our requirements are changing so that we’ll only have to deal with pure XML data files and the work of converting to/from MSWord will be handled a different way by someone else. If so, we’ll no longer be needing Aspose.Words, at least for this project.

I’m still interested in the technology, so I’ll continue to play with it, but at a low level.