----- Original Message ----- From: "Martin J. Dürst" <duerst@xxxxxxxxxxxxxxx> To: "Henry S. Thompson" <ht@xxxxxxxxxxxx> Cc: "John Cowan" <cowan@xxxxxxxxxxxxxxxx>; "IETF Discussion" <ietf@xxxxxxxx>; "Pete Cordell" <petejson@xxxxxxxxxxxxx>; "JSON WG" <json@xxxxxxxx>; "Anne van Kesteren" <annevk@xxxxxxxxx>; <www-tag@xxxxxx>; "es-discuss" <es-discuss@xxxxxxxxxxx> Sent: Monday, November 18, 2013 11:26 AM > On 2013/11/18 20:11, Henry S. Thompson wrote: > > Pete Cordell writes: > > > >> Given the history below, would it be sensible to accept BOMs for UTF-8 > >> encoding, but not for UTF-16 and UTF-32? In other words, are BOMs needed > >> and/or used in the wild for UTF-16 and UTF-32? > >> > >> Maybe the text can say something like "SHOULD accept BOMs for UTF-8, > >> and MAY accept BOMs for UTF-16 and / or UTF-32"? > > > > My sense is that you'll see more UTF-16 BOMs than anything else. > > Yes indeed. BOM means Byte Order Mark. It's crucial for over-the-wire > UTF-16. (It's irrelevant for in-memory UTF-16, but that's not what we > are discussing.) To bring up the XML example again, XML actually > strictly requires a BOM for UTF-16. The IETF definition of UTF-16 does > not require a BOM for UTF-16. See http://tools.ietf.org/html/rfc2781, in > particular http://tools.ietf.org/html/rfc2781#section-3.2, > http://tools.ietf.org/html/rfc2781#section-3.3, and > http://tools.ietf.org/html/rfc2781#section-4. > > For UTF-8, the BOM is not a Byte Order Mark, because such a mark isn't > necessary at all. It may serve as a signature, but is not necessary, and > in some circumstances counterproductive. Martin We had a similar discussion with syslog back in 2005, the issue being that UTF-8 was new and different and how to tell whether it was being used or not, and what made it into RFC5424 was " If a syslog application encodes MSG in UTF-8, the string MUST start with the Unicode byte order mask (BOM), which for UTF-8 is ABNF %xEF.BB.BF. " which remains a MUST to this day. There are no relevant Errata. Tom Petch > As for what to say about whether to accept BOMs or not, I'd really want > to know what the various existing parsers do. If they accept BOMs, then > we can say they should accept BOMs. If they don't accept BOMs, then we > should say that they don't. > > Regards, Martin. > > > UTF-32 support seems to be waning (at least in the browsers), but > > UTF-16 is in pretty widespread use. John, do you think you can fool > > google into counting BOMs for us? > >