Read word doc line by line


#1

I am interested in this product, but before I buy it I would like to know if there is a way to parse information out of a word document using this tool. I have previously used office automation using the Word Object in .Net, but it isn’t stable to run Word as a Server app. So I’d like to know if the Aspose.Word Dll can do this? I need to read a document line by line.

Thanks for your time.


#2

Interesting question. I’m pretty sure we can do something about it as soon as we understand more what information from the document do you need.

1. We eventually plan to implement object model similar to Word automation, but that’s still in a pretty distant future, so we must be looking at something else for now.

2. Another solution we could explore is to provide callback events from Aspose.Word when reading a file. For example, it could call back when encounters, paragraph, section, header, table, picture and so on. How’s that sound?

3. What information do you need to get? If just text is enough, then you could use different approach. Just let Aspose.Word save in text format and read your lines from the text file.

4. Do you need more details about the structure such as sections, tables, headers, formatting, fields, styles? If that’s the case maybe if we support saving, say in HTML of some XML such as WordprocessingML, will it help?

So to summarize I see three options:

API to access object model - cannot easily do similar to MS Word API

Event based mechanism to notify the client about various parts of the document

Save to a file in some format that the client can read and extract necessary info

What do you think?


#3

Thanks for the reply. The data I need to get is very specific to the document. Say I need an address. It is formatted as such:

Name: Sara
Address: asdjhajsdhasd
City: asdasd
State: asdjasdh
phone: etc.


followed by lots of other info. The top lines are identifying characteristics of the document.

So, can Aspose.word save it as a text file and remove all of the gibberish MS puts in?

When I use just .net to save it as a text file, and read the lines, there are a lot of strange characters there.

Hope that makes more sense.


#4

Yes, you can save it as a plain text file.
You don’t even have to save to a file, you can save it into a stream.
Just use to proper overload of Document.Save method and specify SaveFormat.FormatText as a parameter.
As long as you don’t need any formatting or document structure this is probably the simplest approach to take.

Aspose.Word uses CrlLf combination to delimit paragraphs and removes most of the special characters.

It lets some of the special characters through such as section, page, line etc breaks. Just let me know if you particularly don’t like some of them and we will also filter them out.


#5

Hi Roman,
Just a quick note from a concerned developer - the rest of us might be relying on these characters that you are offering to filter out.

This is known as a breaking change. If you intend to make this sort of change, extend the function to support features to filter these characters instead. This is so that the rest of us can maintain the current functionality.

Of course that’s what you may have meant in your reply…in which case I apologise.

Excellent work BTW…

Best regards
Steve


#6

No change for now, but thanks for the reminder. We will add options rather than break the behaviour.