Re: Should the IETF be condoning, even promoting, BOM pollution?

John C Klensin <john-ietf@xxxxxxx> · Tue, 26 Sep 2017 04:03:27 -0400

--On Monday, September 18, 2017 18:16 -0500 Adam Roach
<adam@xxxxxxxxxxx> wrote:

> I think we're talking at cross purposes here.
> 
> Today, as we speak, I have a copy of the RFC repository on my
> hard drive. (To be precise, I have it on most of the hard
> drives of the various machines that I use). For my current
> workflow, I *think* all of them got there via rsync, although
> it's possible that some of them are still using an old
> wget-based setup. It's kind of immaterial how they got there,
> because a careful examination of them would show the same
> result between the two methods (and any others I could think
> of, including FTP mirroring and manually downloading via web
> browsers): it's a sequence of bytes, with a ".txt" file
> extension; identical, regardless of which tool downloaded
> them. There is nothing else about the file to indicate its
> encoding.[1]
> 
> Okay. So, now, I open up the local file browser to that file
> on my hard drive, and double-click on an RFC. An application
> is launched. Let's say that application is Wordpad. How does
> it know which character encoding to use for this file?

It doesn't and the presence of absence of a pair of octets it
might interpret as a BOM just feeds another heuristic.   Keep in
mind that, if the content of that file were in 8859-1, that
could be interpreted as Small Thorn followed by small y with
diaeresis (the characters Unicode codes as U+00FE and U+00FF).
Or course, it if were coded in ISO 8859-5 (or 6, 7, 8, 11,
etc.), there would be different interpretations.  I note that
several versions of Wordpad will get just about equally confused
if your rsync (or whatever) fetch results in an object in your
local, Windows-ish, file system with LF as an EOL rather than
CRLF.  

Conventions about file names suffixed in ".txt" have worked as
well as they have only because it has been possible to assume
that ".txt" implies ASCII.  From the early days of the net, even
that has not been perfect, not just because EOL=LF has existed
since early on (IIR, the first version of ASCII required it),
but because there was that EBCDIC problem.  The only real
solution to these problems is files that carry their own
descriptions (the idea of a two-part, or even three-part, file
where one part is a description predates its adoption by Apple
my many years).  Otherwise, it is all heuristics and the other
strong argument against BOM as a "this is UTF-8" clue is that it
often won't work.

I got over believing that it was reasonable to try to abstract
file description information down to a few characters and then
embed it in the file name in the early 1970s, but obviously lost
that battle long ago.   Perhaps, if we need to indicate what is
UTF-8 and what isn't, we should start suffixing files with
".utf8" or, if people like three character suffixes, ".ut8" or
".uf8", rather than relying on in-file indicators that violate
the relevant standards, don't adequately identify the relevant
CCS, and invite assorted file concatenation problems, etc.

    john