I’m using Words for Java and try to read lots of documents with greek symbols, which are displayed by the Symbol font.
If I read these documents I get strange results for those symbols like for example
small alpha = '\uf061’
small beta = '\uf062’
small gamma = '\uf067’
The results are not unicode nor are they strings encoded by the symbol charset.
So, how should I interpret these results?
Can you provide the document so that I can reproduce a problem? You can even truncate the doc and leave in it only few Greek symbols.
Sure! I added a zip file, which contain a basic test file.
1. MS Word really saves _these_ values. Aspose.Words just displays them.
2. This is Unicode.
Since Word 97, MS Word translates all ascii keys to Unicode using current code page. With symbol fonts this translation little complicated (or simplified?). Here is a small quote:
“When a symbol font is selected, text that is entered is translated to Unicode by a different translation than that provided by the current codepage. Since a symbol font could potentially contain any character or graphic symbol, there is no way to know if the character being entered is defined in Unicode or not. As a result, Word will translate the 8-bit value entered to a special range in Unicode known as the “Private Use Area”. Specifically, Word will translate the 8-bit value by adding 0xF000”
You can use Greek chars from non-symbol fonts to get more consistent results.
Okay, then the characters are just shifted about 0xf000 and are encoded by the Symbol code page.
That explains it, thank you.