Re: Should the IETF be condoning, even promoting, BOM pollution?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/18/17 1:43 AM, Carsten Bormann wrote:
The problems caused by BOM pollution were already well understood at the time when the various standards around UTF-8 were written.  ...  RFC 3629 has a whole section denouncing it.


You refer to section 6? <https://tools.ietf.org/html/rfc3629#section-6> Denunciation seems like a pretty severe mischaracterization. I think the verb you were looking for is "endorsing."

Let's look at the actual guidance (ignoring for the moment that it's talking about protocol design rather than archival documents):


   o  A protocol SHOULD forbid use of U+FEFF as a signature for those
      textual protocol elements that the protocol mandates to be always
      UTF-8, the signature function being totally useless in those
      cases.


Well, *that's* not us, since we demonstrably have a mix of ASCII and UTF-8 documents. (And, on the protocol level, servers that serve up RFCs can certainly have a mix of these and other encodings)


   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol provides
      character encoding identification mechanisms, when it is expected
      that implementations of the protocol will be in a position to
      always use the mechanisms properly.  This will be the case when
      the protocol elements are maintained tightly under the control of
      the implementation from the time of their creation to the time of
      their (properly labeled) transmission.


Given that these things are replicated all over the place and metadata (such as character encoding) are typically not part of that replication, this also seems inapplicable.


   o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol does not
      provide character encoding identification mechanisms, when a ban
      would be unenforceable, or when it is expected that
      implementations of the protocol will not be in a position to
      always use the mechanisms properly.  The latter two cases are
      likely to occur with larger protocol elements such as MIME
      entities, especially when implementations of the protocol will
      obtain such entities from file systems, from protocols that do not
      have encoding identification mechanisms for payloads (such as FTP)
      or from other protocols that do not guarantee proper
      identification of character encoding (such as HTTP).


Oh. OH! There it is. It even calls out FTP and HTTP as the poster children for "SHOULD NOT forbid", and I suspect that we're well past 99% of all RFC access being over those two protocols (with the vast majority being HTTP).


We did a lot of hashing over of this issue on the rfc-design team (and I even started out where you are now, advocating against a BOM). While I concede that there is no perfect solution, the "pro" column for using BOMs ended up being far more compelling than the "con" column.

Even if we find the philosophical arguments to be a draw, this single practical argument is what convinced me that not including a BOM would be a solid request for failure:

On 11/3/13 12:08 PM, Dave Thaler wrote:

Reality check...

I just ran a test with two UTF-8 files, one with a BOM and one without. In case you want to try them yourself, they're at

http://research.microsoft.com/~dthaler/Utf8NoBom.txt

http://research.microsoft.com/~dthaler/Utf8WithBom.txt

It includes Latin, Greek, and Cyrillic.


I tried opening them with a bunch of utilities, and browsers (opening local files not using HTTP), and used browsershots.org to get screenshots of HTTP access across many browsers and platforms. Note the HTTP server provides no content encoding headers so it's up to the app to detect. I just copied the files to a generic web server, and we may expect others would do the same with their own I-Ds and RFC mirrors.

Results:

1) Some apps worked fine with both files.  These include things like notepad, outlook, Word, file explorer, Visual Studio 2012

2) Some apps failed with both files (probably written to be ASCII only). These include things like Windiff, stevie (a vi clone), textpad, and the Links browser (on Ubuntu), and the Konquerer browser (on Ubuntu)

3) Everything else, including almost all browsers, only displayed the file correctly with the BOM

This included:
  • Windows apps: Wordpad
  • Windows using local files (no HTTP): IE, Firefox, Chrome
  • Windows using HTTP: IE, Firefox, Chrome, Navigator
  • Mac OSX: Safari, Camino
  • Debian: Opera, Dillo
  • Ubuntu: Luakit, Iceape

Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing tools and browsers today, any UTF-8 text format needs to include a BOM.


/a


[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]