On 2017-09-18 19:50, Adam Roach wrote:
On 9/18/17 1:43 AM, Carsten Bormann wrote:
The problems caused by BOM pollution were already well understood at the time when the various standards around UTF-8 were written. ... RFC 3629 has a whole section denouncing it.
You refer to section 6? <https://tools.ietf.org/html/rfc3629#section-6>
Denunciation seems like a pretty severe mischaracterization. I think the
verb you were looking for is "endorsing."
Let's look at the actual guidance (ignoring for the moment that it's
talking about protocol design rather than archival documents):
o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those
cases.
Well, *that's* not us, since we demonstrably have a mix of ASCII and
UTF-8 documents. (And, on the protocol level, servers that serve up RFCs
can certainly have a mix of these and other encodings)
Well, any text file encoded in US-ASCII by definition is also encoded in
UTF-8.
I believe a better argument is that we do not control the tools that
people use to read plain text RFCs, and that we found that some
important ones simply work better with the BOM.
o A protocol SHOULD also forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly. This will be the case when
the protocol elements are maintained tightly under the control of
the implementation from the time of their creation to the time of
their (properly labeled) transmission.
Given that these things are replicated all over the place and metadata
(such as character encoding) are typically not part of that replication,
this also seems inapplicable.
Right.
o A protocol SHOULD NOT forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol does not
provide character encoding identification mechanisms, when a ban
would be unenforceable, or when it is expected that
implementations of the protocol will not be in a position to
always use the mechanisms properly. The latter two cases are
likely to occur with larger protocol elements such as MIME
entities, especially when implementations of the protocol will
obtain such entities from file systems, from protocols that do not
have encoding identification mechanisms for payloads (such as FTP)
or from other protocols that do not guarantee proper
identification of character encoding (such as HTTP).
Oh. OH! There it is. It even calls out FTP and HTTP as the poster
children for "SHOULD NOT forbid", and I suspect that we're well past 99%
of all RFC access being over those two protocols (with the vast majority
being HTTP).
We did a lot of hashing over of this issue on the rfc-design team (and I
even started out where you are now, advocating against a BOM). While I
concede that there is no perfect solution, the "pro" column for using
BOMs ended up being far more compelling than the "con" column.
That's my recollection as well.
Even if we find the philosophical arguments to be a draw, this single
practical argument is what convinced me that not including a BOM would
be a solid request for failure:
On 11/3/13 12:08 PM, Dave Thaler wrote:
Reality check...
I just ran a test with two UTF-8 files, one with a BOM and one
without. In case you want to try them yourself, they're at
http://research.microsoft.com/~dthaler/Utf8NoBom.txt
http://research.microsoft.com/~dthaler/Utf8WithBom.txt
It includes Latin, Greek, and Cyrillic.
I tried opening them with a bunch of utilities, and browsers (opening
local files not using HTTP), and used browsershots.org to get
screenshots of HTTP access across many browsers and platforms. Note
the HTTP server provides no content encoding headers so it's up to the
app to detect. I just copied the files to a generic web server, and we
may expect others would do the same with their own I-Ds and RFC mirrors.
Results:
1) Some apps worked fine with both files. These include things like
notepad, outlook, Word, file explorer, Visual Studio 2012
2) Some apps failed with both files (probably written to be ASCII
only). These include things like Windiff, stevie (a vi clone),
textpad, and the Links browser (on Ubuntu), and the Konquerer browser
(on Ubuntu)
3) Everything else, including almost all browsers, only displayed the
file correctly with the BOM
This included:
* Windows apps: Wordpad
* Windows using local files (no HTTP): IE, Firefox, Chrome
* Windows using HTTP: IE, Firefox, Chrome, Navigator
* Mac OSX: Safari, Camino
* Debian: Opera, Dillo
* Ubuntu: Luakit, Iceape
Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing
tools and browsers today, any UTF-8 text format needs to include a BOM.
And yes, that's a good summary from back then.
Best regards, Julian