Extracting text - problems with single quotes


#1

Part of our app involves extracting text from word documents… However, when using Aspose.Word to read such documents, all ’ chars in the documents are returned as ‘squares’ (e.g. Defendant’s comes out as Defendanta??s)

Any ideas?


#2

I’ve been playing around with it a bit more… I think the problem lies in the Aspose.Word.SaveFormat.FormatText

I get the squares when I use this method to convert the binary stream to text…
Dim b() As Byte = strm.ToArray
Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To b.Length - 1
sb.Append(ChrW(b(i)))
Next

But if I go straight to system.encoding…
Dim sb As String = System.Text.ASCIIEncoding.ASCII.GetChars(strm.ToArray)

I get ’ interepreted as ??? i.e. Defendant???s


#3

Interestingly - it doesn’t happen in ALL documents…

so I have emailed you two examples. One that works. one that doesnt.


#4

Hi Jat,

I’ll check what encoding happens when we save in text format. Maybe will need to provide a way for the user to control it.

It does not happen in all documents probably because the ’ char is sometimes part of the low 0-128 ASCII table, but sometimes it is a Unicode character with a different value.

Do you have this problem only with the ’ char or noticed for anything else?

Apparently you want the ’ char to output correctly in the text document, is that what you are after?


#5

Hi Roman,

Only with the ’ char, and I want the ‘string’ that I build via extracting text from the document to correctly render the '

thanks

jat


#6

Hi Jat,

I have no problems with the file. This is what I get “Defendant’s request” so its all okay.

Please make sure you use the latest hotfix since we did some work on encodings and handling of special symbols last month.

If you alreay using the latest version the please give more details about what do you actually do. Why did you end up using binary reader and text builder?

I save to text file, open in Notepad and it looks okay to me. Then I also do:
using (StreamReader reader = System.IO.File.OpenText(@“broken out.txt”))
{
string s = reader.ReadToEnd();
}

and apostrophes in s are okay too.


#7

Roman,

I am using the latest version/hotfix.

There is no need for me to save to a file, and I don’t understand what you mean about saving it to a file and then opening it again with a streamreader, as that is not required.

Basically what I have is the facility for a user to upload a word document, where I extract a textual representation of that document. I.e. so the person would upload the ‘broken.doc’ and aspose would open the doc, and extract the text contained within that doc.

Below is the method we use to do that, where s_Path is the path to the file the user has uploaded. What it should return is a textual representation of the doc. It does, in fact, do that but the 's become corrupt.


On a side point, our clients want preservation of tables in the text output. Is it possible to have the table cells in a word doc to map to ‘tabbed columns’. What I mean is:

In ms word

---------------------------------------
| Table header | Header Col 2 |
---------------------------------------
| Some data | Some more |
| Row 2 col 1 | Some morez |
---------------------------------------

Come out in text (saveformat.text) as

Table header Header Col 2
Some data Some more
Row 2 col 1 Some morez


Thanks for your help

jat



Private Function ExtractWordTextAspose(ByVal s_Path As String) As String


Try
'startup the word dll
Dim word As Aspose.Word.Word = New Aspose.Word.Word
word.SetLicense(System.Configuration.ConfigurationSettings.AppSettings.Item(“WordLic”).ToString, Page)

'load the file
Dim mainDoc As Aspose.Word.Document
mainDoc = word.Open(s_Path)

'work with the document (mailmerge nothing)
Dim mergeDoc = mainDoc.MailMerge.Execute(New String() {}, New Object() {})

'create a memory stream
Dim strm As System.IO.MemoryStream = New System.IO.MemoryStream(0)
'write the document text to that stream
mergeDoc.save(strm, Aspose.Word.SaveFormat.FormatText)

'create the string
Dim sb As String = System.Text.ASCIIEncoding.ASCII.GetChars(strm.ToArray)

'delete the file
Kill(s_Path)

'return the string
Return sb
Catch
Throw New Exception(“Could not extract text from the document. Please check the file.”)
End Try

End Function


#8

Hi Jat,

Thanks, it is clear now.

There is a number of apostrophe looking characters around. You can look them up in Word Inser/Symbol. Here are just a few:

0x27 - Apostrophe
0x60 - Grave Accent
0xb4 - Acute Accent
0x2019 - Right Single Quotation

Your document has 0x2019 character (which is obviously a Unicode character) and you are trying to use ASCII encoding to read it hence you get the garbage.

There are two things you can do:

Change the character in the document so it is just an apostrophe (0x27).

or

Use different encoding to read the string. I looked into Aspose.Word and it creates StreamWriter class with default encoding which happens to be UTF8 so if you change your code to be:

string myString = System.Text.Encoding.UTF8.GetString(stream.ToArray());

It all works fine.

I’m open to discussion what encoding Aspose.Word should be saving in and probably it should be made controllable by the user.

Also, at the moment Aspose.Word closes the stream after it finished writing to it, but I think I will change it so the it is the client’s responsibility to close the stream. If client opens the stream it should close it, not Aspose.Word. This will make it possible to read the string from the memory stream without converting into byte array.

It will be possible to do this:

MemoryStream stream = new MemoryStream();
doc.Save(stream, SaveFormat.FormatText);

stream.Position = 0;
StreamReader reader = new StreamReader(stream);
string myString = reader.ReadToEnd();

Which simplifies your code a little. This is not critical and will be in the next release.


#9

Thanks Roman,

Clearly it was an encoding problem, given that ChrW and ASCII gave different representations - I just wasn’t thinking.


#10

Regarding alignment of text into table cells I don’t think we can help immediately.
This will require some sort of a layout engine which Aspose.Word is currently not.

You could try to cheat and have users to create documents that when rendered into text form aligned cells naturally. No better ideas at this stage, sorry.