Re: Should the IETF be condoning, even promoting, BOM pollution?

Matthew Kerwin <matthew@xxxxxxxxxxxxx> · Tue, 26 Sep 2017 22:24:25 +1000

And so we circle around again...

On 26 September 2017 at 21:06, Carsten Bormann <cabo@xxxxxxx> wrote:
On Sep 26, 2017, at 12:55, Julian Reschke <julian.reschke@xxxxxx> wrote:

>

> Please cite *specifically* what you think is relevant with respect to the use of BOMs in plain text files.

That’s all already been said in the thread, but to repeat, with links:

STD0063 section 6:

https://tools.ietf.org/html/rfc3629#section-6

"Use of a BOM is neither required nor recommended for UTF-8":

http://www.unicode.org/versions/Unicode10.0.0/ch02.pdf

And RFC 5198, section 2, item 5:

https://tools.ietf.org/html/rfc5198#section-2

Of course, BOM-pollution apologists will find enough rope in these documents to hang themselves.

That is really the problem here: the tendency to weasel around decisions in standards.

(Or to make them in the first place.  UCS-2-BE vs. UCS-2-LE all over again.)

Grüße, Carsten

RFC 3629 essentially boils down to: A protocol SHOULD forbid the BOM if UTF-8 is mandated, or otherwise signalled.  That and all the other guidance seems to say: guessing is bad, don't guess (even with heuristics) if there's any other way.  For HTTP there is another way (content-type with charset).  What's the other way for a .txt file on my FAT32 thumb stick?  The standards don't seem to say much about that, because there was no solution when they were being written, just as there is none now.  (Note that RFC 5198 doesn't apply to files on my thumb stick since it's not internet-enabled, and similarly 3629 talks about "protocols" but I don't think fopen();read() counts as a protocol.)

You're arguing that the standards say "BOM is bad", while I read them as saying "guessing is bad."  Building from that position, I see the .txt extension + BOM magic number combination as a better signal (i.e. less guessy) than the extension alone that a file is likely a UTF-8 encoded plaintext file.  I'm well aware that makes me a "BOM-pollution apologist", but I haven't been offered anything better so far.

Somewhere in here there seems to be an argument for having richer metadata capabilities in file systems, and/or more intelligent downloading tools that translate between the web's metadata and the file system's (whether that's rich metadata, or file name patterns, or magic numbers, or whatever).  And if so, I agree enthusiastically.  Unfortunately I'm not in a position to much about that.  The penultimate argument seems to be: serve the RFCs over HTTP as text/plain;charset=utf-8 *without* the BOM (as per standards), and make the human deal with it if their downloading/viewing tools won't do the Right Thing™ (as per tradition in IT).  Given that RFCs are usually read by technically savvy people, I guess that's live-withable.

I'm not going to go flipping any tables whichever way this discussion ends up;  I prefer the HTML versions anyway (even if they're non-canonical.)

Cheers
-- 
  Matthew Kerwin
  http://matthew.kerwin.net.au/