Re: BOMs

"t.p." <daedulus@xxxxxxxxxxxxx> · Tue, 19 Nov 2013 10:10:47 +0000

----- Original Message -----
From: "Martin J. Dürst" <duerst@xxxxxxxxxxxxxxx>
To: "Henry S. Thompson" <ht@xxxxxxxxxxxx>
Cc: "John Cowan" <cowan@xxxxxxxxxxxxxxxx>; "IETF Discussion"
<ietf@xxxxxxxx>; "Pete Cordell" <petejson@xxxxxxxxxxxxx>; "JSON WG"
<json@xxxxxxxx>; "Anne van Kesteren" <annevk@xxxxxxxxx>;
<www-tag@xxxxxx>; "es-discuss" <es-discuss@xxxxxxxxxxx>
Sent: Monday, November 18, 2013 11:26 AM

> On 2013/11/18 20:11, Henry S. Thompson wrote:
> > Pete Cordell writes:
> >
> >> Given the history below, would it be sensible to accept BOMs for
UTF-8
> >> encoding, but not for UTF-16 and UTF-32?  In other words, are BOMs
needed
> >> and/or used in the wild for UTF-16 and UTF-32?
> >>
> >> Maybe the text can say something like "SHOULD accept BOMs for
UTF-8,
> >> and MAY accept BOMs for UTF-16 and / or UTF-32"?
> >
> > My sense is that you'll see more UTF-16 BOMs than anything else.
>
> Yes indeed. BOM means Byte Order Mark. It's crucial for over-the-wire
> UTF-16. (It's irrelevant for in-memory UTF-16, but that's not what we
> are discussing.) To bring up the XML example again, XML actually
> strictly requires a BOM for UTF-16. The IETF definition of UTF-16 does
> not require a BOM for UTF-16. See http://tools.ietf.org/html/rfc2781,
in
> particular http://tools.ietf.org/html/rfc2781#section-3.2,
> http://tools.ietf.org/html/rfc2781#section-3.3, and
> http://tools.ietf.org/html/rfc2781#section-4.
>
> For UTF-8, the BOM is not a Byte Order Mark, because such a mark isn't
> necessary at all. It may serve as a signature, but is not necessary,
and
> in some circumstances counterproductive.

Martin

We had a similar discussion with syslog back in 2005, the issue being
that UTF-8 was new and different and how to tell whether it was being
used or not, and what made it into RFC5424 was
"  If a syslog application encodes MSG in UTF-8, the string MUST start
   with the Unicode byte order mask (BOM), which for UTF-8 is ABNF
   %xEF.BB.BF.  "
which remains a MUST to this day.  There are no relevant Errata.

Tom Petch

> As for what to say about whether to accept BOMs or not, I'd really
want
> to know what the various existing parsers do. If they accept BOMs,
then
> we can say they should accept BOMs. If they don't accept BOMs, then we
> should say that they don't.
>
> Regards,   Martin.
>
> > UTF-32 support seems to be waning (at least in the browsers), but
> > UTF-16 is in pretty widespread use.  John, do you think you can fool
> > google into counting BOMs for us?
>
>