The Document class garbles characters when the input stream is an HTML file encoded as UTF8 without the 3-byte Byte Order Marker (BOM). I assume the same problem exists for UTF16 files without the BOM. This garbling can be reproduced as follows:
- Create stream for the attached HTML file (encoded as UTF8 but without BOM)
- Create new Document using that stream
- Call ToTxt() and examine the resulting string
São Paulo – SP – Brazil
Is garbled into:
SÃ£o Paulo â€" SP â€" Brazil
You can auto-detect UTF8 by scanning the bytes to see if they match the UTF8 Character Sequences (see http://en.wikipedia.org/wiki/UTF-8 for details). And you can auto-detect UTF16 by using the IsTextUnicode Win32 function.