Re: Should the IETF be condoning, even promoting, BOM pollution?

Julian Reschke <julian.reschke@xxxxxx> · Tue, 26 Sep 2017 13:26:19 +0200

On 2017-09-26 13:06, Carsten Bormann wrote:
On Sep 26, 2017, at 12:55, Julian Reschke <julian.reschke@xxxxxx> wrote:

Please cite *specifically* what you think is relevant with respect to the use of BOMs in plain text files.

That’s all already been said in the thread, but to repeat, with links:

STD0063 section 6:
https://tools.ietf.org/html/rfc3629#section-6

Like:

   o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol does not
      provide character encoding identification mechanisms, when a ban
      would be unenforceable, or when it is expected that
      implementations of the protocol will not be in a position to
      always use the mechanisms properly.  The latter two cases are
      likely to occur with larger protocol elements such as MIME
      entities, especially when implementations of the protocol will
      obtain such entities from file systems, from protocols that do not
      have encoding identification mechanisms for payloads (such as FTP)
      or from other protocols that do not guarantee proper
      identification of character encoding (such as HTTP).

...which is *exactly* what we're discussing here?

"Use of a BOM is neither required nor recommended for UTF-8": > http://www.unicode.org/versions/Unicode10.0.0/ch02.pdf

That talks about whether a BOM is or is not useful to distinguish 
between Unicode encoding schemes. But that's not really relevant here, 
unless all plain text files were indeed already in one of the Unicode 
encoding schemes. They are not, and that's the problem.

And RFC 5198, section 2, item 5:
https://tools.ietf.org/html/rfc5198#section-2
...

That has the same problem - it assumes a world that is already fully 
Unicode, in which case it's correct to say that the BOM is not needed.

However, plain text files are something that predates all of this, and 
the tools that the consumers of plain text RFCs use deal with this mixed 
encoding world in several ways.

I agree that if the goal was to promote an all-unicode world, the answer 
would be different. But the goal of the RFC Editor is to deliver 
documents that people will be able to read properly with the tools they 
have. The tests we did showed that adding the BOM is beneficial for this.

(That said: this is a in-between period - once the transition to the 
format is finished, the preferred consumption format will be HTML anyway)

Of course, BOM-pollution apologists will find enough rope in these documents to hang themselves.
That is really the problem here: the tendency to weasel around decisions in standards.
(Or to make them in the first place.  UCS-2-BE vs. UCS-2-LE all over again.)

My point being: none of the things you apparently refer to applies to 
what we are discussing here.

Best regards, Julian

BTW: if you believe that the text *I* quoted from RFC3629 is bad, you 
might want to submit an erratum and/or start a discussion on updating 
the document.