Re: Should the IETF be condoning, even promoting, BOM pollution?

John C Klensin <john-ietf@xxxxxxx> · Tue, 19 Sep 2017 10:17:41 -0700

--On Tuesday, September 19, 2017 7:05 PM +0200 Julian Reschke
<julian.reschke@xxxxxx> wrote:

>...
>> (2) I note that Dave's tests applied to Microsoft bundled
>> applications.   If they are the main problem, then Microsoft
>> should be ashamed of themselves for updating those
>> applications to handle non-ASCII codes and then violating the
>> clear rules for UTf-8 (if they allow UTF-8 at all -- if they
>> decided to not do that and only allow, e.g., UTF-16, that
>> would be a different matter).  While I hope bug reports have
>> been filed, the IETF (or RFC Editor) setting out to break
>> those applications is just not what we do.
> 
> Microsoft's support for non-ASCII characters predates Unicode
> (AFAIU). Notepad has been dealing with non-ASCII characters
> for ages.

Understood, my opinions about how well that worked, especially
for non-Latin scripts, notwithstanding.  But, again, using BOM
as a substitute for charset=UTF8, is, at least IMO, not the
brightest of ideas even though I'm also aware that we had a
difficult transition when the web went from a default of Latin-1
to Unicode.

> Not *defaulting* to UTF-8 is not a bug. It may not be what our
> preference is nowadays, but that's it.

See about.  Slightly different discussion.   But I note that it
isn't hard to distinguish between Latin-1 and UTF-8 without
relying on BOM -- the hard problem there involves distinguishing
between the various species of 8859 and assorted code pages.

>> ...
>> (4) At the same time, if the complaint is about terrible
>> typography, that is a complaint about plain-text files without
>> any formatting controls and markup, not about ASCII.  If
>> someone dislikes plain-text files, they should, IMO, be
>> looking for a way to do something else (e.g., PDF or HTML),
>> not trying to "fix" plain-text files.
>> ...
> 
> Nobody is doing that, as far as I can tell. And yes, we'll
> have official HTML variants with better typography. In the
> meantime, people can look at unofficial ones.

Exactly.

>> ...
>> (6) If any of the new norms and tools result in plain-text
>> files with only ASCII characters in them starting with a BOM
>> because ASCII is just a subset of UTF-8, I'd consider that
>> seriously broken, a violation of the ASCII standard, and a
>> few other things.  I hope tools and test suites would check
>> for that case and complain if it is encountered.
> 
> I'd consider that a feature, far better than adding it on a
> case-by-case basis. And no, it's not a violation of the ASCII
> standard, as that standard wouldn't apply anymore.

But there I believe there has been strong community consensus
for preserving the ASCII format and coding for plain-text files
(or at least one type of plain-text files that do not contain
(substantive) non-ASCII characters.  Won't bother me much (my
plain-text tools work well with and without BOM), but I'd expect
that, if the IETF made a decision to dump ASCII entirely, some
people would look for other places to get work done.  I don't
believe that would be in IETF's, or the Internet's best
interests.  YMMD.

>> And, yeah, I think some (perhaps many) of us are going to need
>> to have simple BOM adding and removing tools around, just as
>> we have had tools that convert from LF-only to CRLF formats
>> handy and get to use them often (I note, e.g., that the online
>  > ...
> 
> dos2unix does this very well.

Indeed it does.   And that is more or less what I was trying to
say.  I keep dos2unix around, if I needed it, I'd keep some sort
of BOM-remover around.  And I wouldn't expect to invest a lot of
energy into having to use the latter, just as I don't invest a
lot of energy complaining about having to use dos2unix.

>> ...

best,
    john