Re: Should the IETF be condoning, even promoting, BOM pollution?

Carsten Bormann <cabo@xxxxxxx> · Mon, 18 Sep 2017 09:37:45 +0200

On Sep 18, 2017, at 08:54, Julian Reschke <julian.reschke@xxxxxx> wrote:
> 
>> The problems come both from tools that otherwise would have no problem upgrading from ASCII to UTF-8, that are now malfunctioning because of those BOMs, and from tools that now suddenly *expect* that all UTF-8 files beyond ASCII have that signature and no longer work when they don’t.  The first set of problems is confounded by tools that are silently inserting BOMs at various stages of processing UTF-8 files (BOM pollution), and by other tools that make any BOM present in a plain text file invisible to casual examination so problems caused by BOM pollution are hard to recognize.
>> ...
> 
> It would be helpful if you have examples for these kinds of tools. I'd like to understand whether this has happened in practice (and when), or whether it's a purely theoretical argument.

Attacking the BOM pollution problem with scientific rigor is a bit outside my domain, but I can give you some anecdotal evidence for the different classes of tools cited above.

A *pollution-intolerant* tool is one that can do UTF-8, but reacts adversely to UTF-8 with BOMs.  Many tools that look for text signatures to do some form of file type detection are foiled by BOMs.  A recent example I happen to remember is Mark Thomson’s I-D template code; this tries to recognize two different variants of markdown by looking at the first three characters and fails if those include a BOM.

A *pollution-expecting* tool is one that requires all UTF-8 files to have a BOM and doesn’t work properly otherwise.  Windows Notepad is rumored to have that problem, but I don’t use Windows, so I can’t verify that.

A *polluting* tool is one that silently adds BOMs.  The anecdotal evidence I have here is hear-say:  When the IETF secretariat briefly switched to serving plain-text as UTF-8, they rolled that back when they heard that people had processing pipelines that combined a polluting with a pollution-intolerant tool, so the pipeline broke.  Apparently, the polluting tool stored any file that was served via HTTP as UTF-8 with a BOM added; I don’t know what the pollution-intolerant tools were.

A *pollution-hiding* tool is one that makes BOM pollution hard to recognize — I think many tools have reacted to increasing BOM pollution by hiding this arcane detail from their users, so there are many examples — pollution-hiding is the result of the usual process of specification soupification.  (As a counter-example, kramdown-rfc provides a warning when it detects BOM pollution in its input, but does process the file; emacs shows the encoding of a BOM-polluted file as “B” [instead of “U” for UTF-8] end preserves the BOM pollution during editing, continuing to treat BOM pollution as another encoding that is alternative to UTF-8 — so it is *pollution-preserving*.  Clang mentions BOM pollution only when it provides a warning or error about the first line: then it explicitly shows the BOM as <U+FEFF>.  And so on — time for a survey paper, I think.)

It is amazing how much programmer time — and, more importantly, user time — must have been wasted by the idea of being “helpful” with the UTF-8 transition through BOM pollution.  This appears to be an instance of the general rule that, when people try to be “helpful” in a protocol transition, you better run as fast as you can...

Grüße, Carsten