HTML gets corrupted when input stream is UTF encoded without BOM

pkeairnes · October 24, 2007, 1:19pm

The Document class garbles characters when the input stream is an HTML file encoded as UTF8 without the 3-byte Byte Order Marker (BOM). I assume the same problem exists for UTF16 files without the BOM. This garbling can be reproduced as follows:

Create stream for the attached HTML file (encoded as UTF8 but without BOM)
Create new Document using that stream
Call ToTxt() and examine the resulting string

Note that:
São Paulo – SP – Brazil
Is garbled into:
SÃ£o Paulo â€" SP â€" Brazil

You can auto-detect UTF8 by scanning the bytes to see if they match the UTF8 Character Sequences (see http://en.wikipedia.org/wiki/UTF-8 for details). And you can auto-detect UTF16 by using the IsTextUnicode Win32 function.

Klepus · October 24, 2007, 3:05pm

Hello!
Thank you for your interest in Aspose products and for your expertise.
I have reproduced this issue and logged as #3955 in our defect database. We’ll let you know when the issue is fixed.
Unicode (UTF-16) without BOM is also recognized incorrectly. Note that in the current implementation loading Unicode files requires document format to be specified explicitly. For instance loading from file will look like this:
Document doc = new Document(filename, LoadFormat.Html, “”);
This is not so hard to add but when I missed the format I got an exception “Unsupported file format”.
As a workaround ensure somehow that all input files have BOM. If you have all files in UTF-8 you can just add three-byte BOM without analyzing actual data, just don’t add it if a file already has it. Of course if the files are different some analysis on them is needed.
Regards,

pkeairnes · October 24, 2007, 7:48pm

Thanks. I am already specifying LoadFormat.Html as you suggested. I had also already implemented the workaround. It detects the various Unicode encodings and falls back to using the HTML charset/encoding meta tags, then if necessary re-encodes the bytes to UTF8 with BOM before passing to the Document class. It was a pain to build, but it works.

Klepus · October 25, 2007, 3:35am

Hi!
I’m sorry you had to workaround this issue. We always make our products better but they are not ideal of course. So if you experience any difficulties we can even help to implement some workarounds. That is reasonable if to fix general case is much harder than to do so. Feel free to ask us for help and suggestions.
Have a nice day,

alexey.noskov · July 2, 2008, 8:32am

The issues you have found earlier (filed as 3955) have been fixed in this update.

aspose.notifier · February 7, 2019, 6:00pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan