I have just installed 3.0.3 from 220.127.116.11. I discovered another different in addition to http://www.aspose.com/Forums/ShowPost.aspx?PostID=26028.
It seems like the url of hyperlink in Words is extracted too.
will be extracted as plain text although it is a hyperlink in Words
E-MAIL : HYPERLINK "mailto:email@example.com" firstname.lastname@example.org
Some strange characters are added. I believed they are the original path of the hyperlink.
This should be a good feature. But we are processing on the text that returned from Range.Text, these additional characters made our system wrong in some part.
Is there an option to select which part to return for hyperlink? e.g. original url, text displayed.
“Strange characters” are control characters that are included into Range.Text. In your case these seem to be FieldStartChar, FieldSeparatorChar and FieldEndChar. Furthermore, currently not only field results are included into Range.Text but also field codes.
At the moment there is no way to select what should be included into Range.Text, but you could implement your own code to extract hyperlink name. Here is an example of how to implement a custom class to extract hyperlink name and target: http://www.aspose.com/Wiki/default.aspx/Aspose.Word/ReplacingHyperlinksExample.html
A Word document contains a number of special chars and also field codes like you've experienced. If your goal is just to get simple text of the document, the best approach would be to use Document.Save(fileName, SaveFormat.FormatText);
This will save into a file or stream. It will automatically use Cr + Lf for end of paragraphs, it will exclude field codes and strip many other special characters from text.
I think this will be the same as the output you used to have from Range.Text. As I said earlier in your other question, Range.Text used to do the same as saving in text format, but now it has changed. Save in text format still works the original way, but Range.Text is sort of more advanced and used for different purposes.
Thanks for your suggestion. I tried to use Document.Save instead of Range.Text and it returns me the text string as what i get in Range.Text in 2.3.0. The strange characters in this post are gone, as well as returning CrLf as in my previous post. I nearly required to revamp all 1400 resumes data entry as the change CrLf to Cr (2 chars to 1 char) caused the index of all documents different.
“Save in text format still works the original way, but Range.Text is sort of more advanced and used for different purposes.”
What is the more advanced features of Range.Text over Document.Save u mean?
Though, using Document.Save. The text that returned by specifying text format does not have any formatting at all. Not even any control char to let me differentiate between line break n table row and table data. Now, all documents with tables will have a line break after each td which is not what we want.
Using Range.Text, we get tr as 2 dot chars. and td as single dot char. We use this info to set out table and line break properly.
Could u please advise, what are the different between Range.Text and Document.Save? If i were to use Range.Text, what are the speacial character that will be inserted beside the hyperlink that i found?
My system need to use the relative index of each characters in 1400 Words document and we need human to manually perform data entry to “train” the system. Obviously, we cant keep changing it.
Range.Text returns text with all of the control characters found in a Word file. See the ControlChar class that defines constants for all possible control characters. Most notable are: ParagraphBreak, Cell, SectionBreak, FieldStart, FieldSeparator, FieldEnd, Picture, DrawnObject, FootnoteRef. These characters are not going to change in Aspose.Word.
Also, as you have seen Range.Text includes both field codes and field results because that’s how text is stored in a Word file. So text of a whole field looks something like [FieldStart][field code][FieldSeparator][field result][FieldEnd], but note that some fields do not have FieldSeparator and do not have field result. Also, fields could be nested inside each other.
If you use Document.Range.Text it will return text of the main text as well as text of the headers and footers in the document. You can use the Range property on any document node to return text of just that node, for example Document.Section.Body.Range.Text will return only the main text of the first section in the document.
You need to remember that the document is a tree or nodes (classes derived from Node and CompositeNode classes) and you can iterate over the tree either in your own loops or by implementing a DocumentVisitor and extract the information the way you want it. This object model was made available only recently in Aspose.Word 3.0.
Please see the programmers guide for more info http://www.aspose.com/Wiki/default.aspx/Aspose.Word/Home%20Page.html
then Document.Save is returning purely raw text rite? then i did rather opt for range.text and remove all unwanted character in code.
You might need to implement some logic to remove field codes like the ones you mentioned with hyperlinks (you might try to use regular expressions for that). Another way is to implement a DocumentVisitor that will allow extracting the text you want. In fact, Aspose.Word uses DocumentVisitor itself to implement save in DOC, HTML and PDF formats - export to any format or to any system can be done this way.
Yup. I will use the ControlChar.FieldStart to match and remove the hyperlink and keep the display text only. should be easy. and i will use CellChar to break down the table too. Thanks for reminding.
One more question, we can save to html. can we read from html too?
Surely, just pass the name of the HTML file to the Document constructor. Another way of inserting HTML is using the DocumentBuilder.InsertHtml method.
However, not all the HTML elements are fully supported. Please refer to our converters information spreadsheet, see the HTML.Import sheet:
Then we dont have to spend on purchasing a separate Html file reader. We can process on all html resumes we receive too.and for the error i reported in http://www.aspose.com/Forums/ShowPost.aspx?PostID=25936#25936 that lots html files are saved in Words format and thus Aspose is throwing an error. I renamed them and saved in .html extension and process in Aspose. The result is similar to what i get from reading a Word file. Will looks into details on the control char and different.
Thanks for your help
I get the following from range.text. May i know how to interpret them?
INCLUDEPICTURE “cid:image001.gif@01C48611.664C9F00” * MERGEFORMATINET
i understood the field start, separator and end control char. but what is that “*. “INCLUDEPICTURE” is the field name, what about “MERGEFORMAT” another field name?
<word.SEITE *Arabisch |4>
<word.SEITE *Arabisch |1><word.SEITE *Arabisch |1>
<word.HYPERLINK "mailto:email@example.com” |firstname.lastname@example.org>a
<word.EINBETTEN PBrush |>
what is the field above? it seems to be a exclaimation mark in the original document. It is a field too? but it didnt begin with field start special character.
EINBETTEN seems like German to me… it means embeded? SEITE -> page. means my user are using German version Words??
I thought you were going to just skip everything from field start to field separator. I don’t think that parsing all MS Word field codes is something that you want to do. There are many fields and many flags. But just for your info “* MERGEGORMAT” is a flag that can be added to many of MS Word fields, it means “preserve formatting during updates” in MS Word.
You are quite right, the field code can be in different languages depending on MS Word version used to create the document. If one needs to determine the field type for sure, it is best to look at FieldStart.FieldType property.
First off I’d like to say that Aspose has really come out with a great product and your support is really good, too.
I think our only gripe in this particular case is that we have built a regression testing system that relies on the index of chars within Word docs. And we have spent a considerable amount of time doing data entry for about 1400+ docs.
When we use the new Range.Text, the additional field code chars and field info throws the indexes off. When we use Document.Save the previously available CellChar chars go missing and that also throws the index off. Either way we get shot in the foot.
I don’t suppose Aspose will changing the Range.Text behaviour back to what it originally was. However I do really hope that this kind of behaviour change won’t happen often or at least be announced beforehand.
Again, thanks for a great product.
I have just changed from version 3.0.3 to 3.1.2. The result returning from Range.Text for Words Field is changed again.
I traced and found out that, a character SOH(ASCII 1) is missing from the new version
E-mail Address : DC3 HYPERLINK “mailto:abc@hotmail” SOH DC4 abc@hotmailNAK
E-mail Address : DC3 HYPERLINK “mailto:abc@hotmail” DC4 abc@hotmailNAK
We found that sometimes hyperlink fields contain a \x0001 character, but sometimes they don’t. So in a move to make the model and nodes appear friendlier and more consistent to the user, we made sure \x0001 inside a hyperlink is never returned to the user. Sorry this was not mentioned in the release notes.
Would this be changed in the future again?
We don’t have plans to change this one, but if a need arises to do so in the future, we reserve the right to do it.
Aspose.Word is still a young product and has to grow and while we try to keep it backwards compatible (we have over 800 unit tests that help to maintain backward compatibility somewhat) we sometimes just have to make breaking changes.
As a suggestion, you don’t have to upgrade with every release or hotfix, your subscription is valid for one year and you can upgrade less frequently. If you are happy with the current feature set, it might help to spend less time maintaining the solution.
Thanks for understanding.