Re: Should the IETF be condoning, even promoting, BOM pollution?

Julian Reschke <julian.reschke@xxxxxx> · Mon, 18 Sep 2017 08:54:19 +0200

On 2017-09-18 08:43, Carsten Bormann wrote:

The reason for the BOM was so that existing tools will load the file correctly in absence of character encoding information.

(AFAIR, the ability to make tools like Notepad "do the right thing" was an important step to actually get to the decision to allow non-ASCII characters).

And yes, this is only relevant for plain text (as opposed to HTML), served from the file system.

Employing the Byte Order Mark (BOM), which is needed in UTF-16 but not in UTF-8, as a file “signature” (magic number) to identify plain text files that use UTF-8 beyond ASCII, is well known to have caused many of the problems in migrating to UTF-8.

The problems come both from tools that otherwise would have no problem upgrading from ASCII to UTF-8, that are now malfunctioning because of those BOMs, and from tools that now suddenly *expect* that all UTF-8 files beyond ASCII have that signature and no longer work when they don’t.  The first set of problems is confounded by tools that are silently inserting BOMs at various stages of processing UTF-8 files (BOM pollution), and by other tools that make any BOM present in a plain text file invisible to casual examination so problems caused by BOM pollution are hard to recognize.
...

It would be helpful if you have examples for these kinds of tools. I'd 
like to understand whether this has happened in practice (and when), or 
whether it's a purely theoretical argument.

Best regards, Julian