On 9/18/17 1:43 AM, Carsten Bormann
wrote:
The problems caused by BOM pollution were already well understood at the time when the various standards around UTF-8 were written. ... RFC 3629 has a whole section denouncing it.
You refer to section 6?
<https://tools.ietf.org/html/rfc3629#section-6> Denunciation
seems like a pretty severe mischaracterization. I think the verb
you were looking for is "endorsing."
Let's look at the actual guidance (ignoring for the moment that
it's talking about protocol design rather than archival
documents):
o A protocol SHOULD forbid use of
U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be
always
UTF-8, the signature function being totally useless in
those
cases.
Well, *that's* not us, since we demonstrably have a mix of ASCII
and UTF-8 documents. (And, on the protocol level, servers that
serve up RFCs can certainly have a mix of these and other
encodings)
o A protocol SHOULD also forbid use of U+FEFF as a signature
for
those textual protocol elements for which the protocol
provides
character encoding identification mechanisms, when it is
expected
that implementations of the protocol will be in a position
to
always use the mechanisms properly. This will be the case
when
the protocol elements are maintained tightly under the
control of
the implementation from the time of their creation to the
time of
their (properly labeled) transmission.
Given that these things are replicated all over the place and
metadata (such as character encoding) are typically not part of
that replication, this also seems inapplicable.
o A protocol SHOULD NOT forbid use of U+FEFF as a signature
for
those textual protocol elements for which the protocol
does not
provide character encoding identification mechanisms, when
a ban
would be unenforceable, or when it is expected that
implementations of the protocol will not be in a position
to
always use the mechanisms properly. The latter two cases
are
likely to occur with larger protocol elements such as MIME
entities, especially when implementations of the protocol
will
obtain such entities from file systems, from protocols
that do not
have encoding identification mechanisms for payloads (such
as FTP)
or from other protocols that do not guarantee proper
identification of character encoding (such as HTTP).
Oh. OH! There it is. It even calls out FTP and HTTP as the poster
children for "SHOULD NOT forbid", and I suspect that we're well
past 99% of all RFC access being over those two protocols (with
the vast majority being HTTP).
We did a lot of hashing over of this issue on the rfc-design team
(and I even started out where you are now, advocating against a
BOM). While I concede that there is no perfect solution, the "pro"
column for using BOMs ended up being far more compelling than the
"con" column.
Even if we find the philosophical arguments to be a draw, this
single practical argument is what convinced me that not including
a BOM would be a solid request for failure:
On 11/3/13 12:08 PM, Dave Thaler wrote:
Reality check...
I just ran a test with two UTF-8 files, one with a BOM and one
without. In case you want to try them yourself, they're at
http://research.microsoft.com/~dthaler/Utf8NoBom.txt
http://research.microsoft.com/~dthaler/Utf8WithBom.txt
It includes Latin, Greek, and Cyrillic.
I tried opening them with a bunch of utilities, and browsers
(opening local files not using HTTP), and used browsershots.org
to get screenshots of HTTP access across many browsers and
platforms. Note the HTTP server provides no content encoding
headers so it's up to the app to detect. I just copied the files
to a generic web server, and we may expect others would do the
same with their own I-Ds and RFC mirrors.
Results:
1) Some apps worked fine with both files. These include things
like notepad, outlook, Word, file explorer, Visual Studio 2012
2) Some apps failed with both files (probably written to be
ASCII only). These include things like Windiff, stevie (a vi
clone), textpad, and the Links browser (on Ubuntu), and the
Konquerer browser (on Ubuntu)
3) Everything else, including almost all browsers, only
displayed the file correctly with the BOM
This included:
- Windows apps: Wordpad
- Windows using local files (no HTTP): IE, Firefox, Chrome
- Windows using HTTP: IE, Firefox, Chrome, Navigator
- Mac OSX: Safari, Camino
- Debian: Opera, Dillo
- Ubuntu: Luakit, Iceape
Conclusion: If we want people to use UTF-8 RFCs and I-Ds with
existing tools and browsers today, any UTF-8 text format needs
to include a BOM.
/a
|