Re: Should the IETF be condoning, even promoting, BOM pollution?

"tom p." <daedulus@xxxxxxxxxxxxx> · Tue, 26 Sep 2017 18:10:58 +0100

----- Original Message -----
From: "Matthew Kerwin" <matthew@xxxxxxxxxxxxx>
To: "Carsten Bormann" <cabo@xxxxxxx>
Cc: "Julian Reschke" <julian.reschke@xxxxxx>; "IETF" <ietf@xxxxxxxx>
Sent: Tuesday, September 26, 2017 1:24 PM
And so we circle around again...

On 26 September 2017 at 21:06, Carsten Bormann <cabo@xxxxxxx> wrote:

> On Sep 26, 2017, at 12:55, Julian Reschke <julian.reschke@xxxxxx>
wrote:
> >
> > Please cite *specifically* what you think is relevant with respect
to
> the use of BOMs in plain text files.
>
> That’s all already been said in the thread, but to repeat, with links:
>
> STD0063 section 6:
> https://tools.ietf.org/html/rfc3629#section-6
> "Use of a BOM is neither required nor recommended for UTF-8":
> http://www.unicode.org/versions/Unicode10.0.0/ch02.pdf
>
> And RFC 5198, section 2, item 5:
> https://tools.ietf.org/html/rfc5198#section-2
>
> Of course, BOM-pollution apologists will find enough rope in these
> documents to hang themselves.
> That is really the problem here: the tendency to weasel around
decisions
> in standards.
> (Or to make them in the first place.  UCS-2-BE vs. UCS-2-LE all over
> again.)
>
> Grüße, Carsten
>
>

RFC 3629 essentially boils down to: A protocol SHOULD forbid the BOM if
UTF-8 is mandated, or otherwise signalled.  That and all the other
guidance
seems to say: guessing is bad, don't guess (even with heuristics) if
there's any other way.  For HTTP there is another way (content-type with
charset).  What's the other way for a .txt file on my FAT32 thumb stick?
The standards don't seem to say much about that, because there was no
solution when they were being written, just as there is none now.  (Note
that RFC 5198 doesn't apply to files on my thumb stick since it's not
internet-enabled, and similarly 3629 talks about "protocols" but I don't
think fopen();read() counts as a protocol.)

You're arguing that the standards say "BOM is bad", while I read them as
saying "guessing is bad."  Building from that position, I see the .txt
extension + BOM magic number combination as a better signal (i.e. less
guessy) than the extension alone that a file is likely a UTF-8 encoded
plaintext file.  I'm well aware that makes me a "BOM-pollution
apologist",
but I haven't been offered anything better so far.

Somewhere in here there seems to be an argument for having richer
metadata
capabilities in file systems, and/or more intelligent downloading tools
that translate between the web's metadata and the file system's (whether
that's rich metadata, or file name patterns, or magic numbers, or
whatever).  And if so, I agree enthusiastically.  Unfortunately I'm not
in
a position to much about that.  The penultimate argument seems to be:
serve
the RFCs over HTTP as text/plain;charset=utf-8 **without** the BOM (as
per
standards), and make the human deal with it if their downloading/viewing
tools won't do the Right Thing™ (as per tradition in IT).  Given that
RFCs
are usually read by technically savvy people, I guess that's
live-withable.

I'm not going to go flipping any tables whichever way this discussion
ends
up;  I prefer the HTML versions anyway (even if they're non-canonical.)

<tp>

With syslog, RFC5424, BOM is mandated when the message is encoded in
UTF8, but that is a protocol and not a filestore.

We do have metadata in a filestore.  As has already been pointed out,
the suffix .txt indicates ASCII and not UTF8 so applications that rely
on that for genuine plain text will likely fail.  As Brian pointed out,
sowhat we need now is a new suffix for UTF.

Tom Petch

Cheers
--
  Matthew Kerwin
  http://matthew.kerwin.net.au/