Re: Should the IETF be condoning, even promoting, BOM pollution?

Adam Roach <adam@xxxxxxxxxxx> · Mon, 18 Sep 2017 12:50:53 -0500

    On 9/18/17 1:43 AM, Carsten Bormann
      wrote:

      The problems caused by BOM pollution were already well understood at the time when the various standards around UTF-8 were written.  ...  RFC 3629 has a whole section denouncing it.

    You refer to section 6?
      <https://tools.ietf.org/html/rfc3629#section-6> Denunciation
      seems like a pretty severe mischaracterization. I think the verb
      you were looking for is "endorsing."
     Let's look at the actual guidance (ignoring for the moment that
      it's talking about protocol design rather than archival
      documents):

   o  A protocol SHOULD forbid use of
        U+FEFF as a signature for those

              textual protocol elements that the protocol mandates to be
        always

              UTF-8, the signature function being totally useless in
        those

              cases.

    Well, *that's* not us, since we demonstrably have a mix of ASCII
      and UTF-8 documents. (And, on the protocol level, servers that
      serve up RFCs can certainly have a mix of these and other
      encodings)

           o  A protocol SHOULD also forbid use of U+FEFF as a signature
        for

              those textual protocol elements for which the protocol
        provides

              character encoding identification mechanisms, when it is
        expected

              that implementations of the protocol will be in a position
        to

              always use the mechanisms properly.  This will be the case
        when

              the protocol elements are maintained tightly under the
        control of

              the implementation from the time of their creation to the
        time of

              their (properly labeled) transmission.

    Given that these things are replicated all over the place and
      metadata (such as character encoding) are typically not part of
      that replication, this also seems inapplicable.

           o  A protocol SHOULD NOT forbid use of U+FEFF as a signature
        for

              those textual protocol elements for which the protocol
        does not

              provide character encoding identification mechanisms, when
        a ban

              would be unenforceable, or when it is expected that

              implementations of the protocol will not be in a position
        to

              always use the mechanisms properly.  The latter two cases
        are

              likely to occur with larger protocol elements such as MIME

              entities, especially when implementations of the protocol
        will

              obtain such entities from file systems, from protocols
        that do not

              have encoding identification mechanisms for payloads (such
        as FTP)

              or from other protocols that do not guarantee proper

              identification of character encoding (such as HTTP).

    Oh. OH! There it is. It even calls out FTP and HTTP as the poster
      children for "SHOULD NOT forbid", and I suspect that we're well
      past 99% of all RFC access being over those two protocols (with
      the vast majority being HTTP).

    We did a lot of hashing over of this issue on the rfc-design team
      (and I even started out where you are now, advocating against a
      BOM). While I concede that there is no perfect solution, the "pro"
      column for using BOMs ended up being far more compelling than the
      "con" column.
    Even if we find the philosophical arguments to be a draw, this
      single practical argument is what convinced me that not including
      a BOM would be a solid request for failure:
    On 11/3/13 12:08 PM, Dave Thaler wrote:

Reality check...

        I just ran a test with two UTF-8 files, one with a BOM and one
        without. In case you want to try them yourself, they're at

        http://research.microsoft.com/~dthaler/Utf8NoBom.txt

        http://research.microsoft.com/~dthaler/Utf8WithBom.txt

        It includes Latin, Greek, and Cyrillic.

        I tried opening them with a bunch of utilities, and browsers
        (opening local files not using HTTP), and used browsershots.org
        to get screenshots of HTTP access across many browsers and
        platforms. Note the HTTP server provides no content encoding
        headers so it's up to the app to detect. I just copied the files
        to a generic web server, and we may expect others would do the
        same with their own I-Ds and RFC mirrors.

        Results:

        1) Some apps worked fine with both files.  These include things
        like notepad, outlook, Word, file explorer, Visual Studio 2012

        2) Some apps failed with both files (probably written to be
        ASCII only). These include things like Windiff, stevie (a vi
        clone), textpad, and the Links browser (on Ubuntu), and the
        Konquerer browser (on Ubuntu)

        3) Everything else, including almost all browsers, only
        displayed the file correctly with the BOM

        This included:

          Windows apps: Wordpad
          Windows using local files (no HTTP): IE, Firefox, Chrome
          Windows using HTTP: IE, Firefox, Chrome, Navigator
          Mac OSX: Safari, Camino
          Debian: Opera, Dillo
          Ubuntu: Luakit, Iceape

        Conclusion: If we want people to use UTF-8 RFCs and I-Ds with
        existing tools and browsers today, any UTF-8 text format needs
        to include a BOM.

    /a