Re: Should the IETF be condoning, even promoting, BOM pollution?

Julian Reschke <julian.reschke@xxxxxx> · Mon, 18 Sep 2017 20:01:36 +0200

On 2017-09-18 19:50, Adam Roach wrote:
On 9/18/17 1:43 AM, Carsten Bormann wrote:
The problems caused by BOM pollution were already well understood at the time when the various standards around UTF-8 were written.  ...  RFC 3629 has a whole section denouncing it.

You refer to section 6? <https://tools.ietf.org/html/rfc3629#section-6> 
Denunciation seems like a pretty severe mischaracterization. I think the 
verb you were looking for is "endorsing."

Let's look at the actual guidance (ignoring for the moment that it's 
talking about protocol design rather than archival documents):

   o  A protocol SHOULD forbid use of U+FEFF as a signature for those
      textual protocol elements that the protocol mandates to be always
      UTF-8, the signature function being totally useless in those
      cases.

Well, *that's* not us, since we demonstrably have a mix of ASCII and 
UTF-8 documents. (And, on the protocol level, servers that serve up RFCs 
can certainly have a mix of these and other encodings)

Well, any text file encoded in US-ASCII by definition is also encoded in 
UTF-8.

I believe a better argument is that we do not control the tools that 
people use to read plain text RFCs, and that we found that some 
important ones simply work better with the BOM.

   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol provides
      character encoding identification mechanisms, when it is expected
      that implementations of the protocol will be in a position to
      always use the mechanisms properly.  This will be the case when
      the protocol elements are maintained tightly under the control of
      the implementation from the time of their creation to the time of
      their (properly labeled) transmission.

Given that these things are replicated all over the place and metadata 
(such as character encoding) are typically not part of that replication, 
this also seems inapplicable.

Right.

   o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol does not
      provide character encoding identification mechanisms, when a ban
      would be unenforceable, or when it is expected that
      implementations of the protocol will not be in a position to
      always use the mechanisms properly.  The latter two cases are
      likely to occur with larger protocol elements such as MIME
      entities, especially when implementations of the protocol will
      obtain such entities from file systems, from protocols that do not
      have encoding identification mechanisms for payloads (such as FTP)
      or from other protocols that do not guarantee proper
      identification of character encoding (such as HTTP).

Oh. OH! There it is. It even calls out FTP and HTTP as the poster 
children for "SHOULD NOT forbid", and I suspect that we're well past 99% 
of all RFC access being over those two protocols (with the vast majority 
being HTTP).

We did a lot of hashing over of this issue on the rfc-design team (and I 
even started out where you are now, advocating against a BOM). While I 
concede that there is no perfect solution, the "pro" column for using 
BOMs ended up being far more compelling than the "con" column.

That's my recollection as well.

Even if we find the philosophical arguments to be a draw, this single 
practical argument is what convinced me that not including a BOM would 
be a solid request for failure:

On 11/3/13 12:08 PM, Dave Thaler wrote:

Reality check...

I just ran a test with two UTF-8 files, one with a BOM and one 
without. In case you want to try them yourself, they're at

http://research.microsoft.com/~dthaler/Utf8NoBom.txt

http://research.microsoft.com/~dthaler/Utf8WithBom.txt

It includes Latin, Greek, and Cyrillic.

I tried opening them with a bunch of utilities, and browsers (opening 
local files not using HTTP), and used browsershots.org to get 
screenshots of HTTP access across many browsers and platforms. Note 
the HTTP server provides no content encoding headers so it's up to the 
app to detect. I just copied the files to a generic web server, and we 
may expect others would do the same with their own I-Ds and RFC mirrors.

Results:

1) Some apps worked fine with both files.  These include things like 
notepad, outlook, Word, file explorer, Visual Studio 2012

2) Some apps failed with both files (probably written to be ASCII 
only). These include things like Windiff, stevie (a vi clone), 
textpad, and the Links browser (on Ubuntu), and the Konquerer browser 
(on Ubuntu)

3) Everything else, including almost all browsers, only displayed the 
file correctly with the BOM

This included:

  * Windows apps: Wordpad
  * Windows using local files (no HTTP): IE, Firefox, Chrome
  * Windows using HTTP: IE, Firefox, Chrome, Navigator
  * Mac OSX: Safari, Camino
  * Debian: Opera, Dillo
  * Ubuntu: Luakit, Iceape

Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing 
tools and browsers today, any UTF-8 text format needs to include a BOM.

And yes, that's a good summary from back then.

Best regards, Julian