Re: So do both [was Re: Should the IETF be condoning, even promoting, BOM pollution?]

Brian E Carpenter <brian.e.carpenter@xxxxxxxxx> · Tue, 10 Oct 2017 10:25:23 +1300

On 10/10/2017 06:14, John C Klensin wrote:
> Sorry for the delay -- catching up on this thread after
> temporarily giving up on it.
> 
> --On Wednesday, September 27, 2017 08:38 +1300 Brian E Carpenter
> <brian.e.carpenter@xxxxxxxxx> wrote:
> 
>> So why don't we, the Internet standards people who believe in
>> rough consensus and running code, request the RFC Editor (a
>> friend of ours) to supply two text versions of each RFC, like
>>
>> https://www.rfc-editor.org/rfc/rfc8187.txt   as today, with
>> BOM if relevant 
>> https://www.rfc-editor.org/rfc/rfc8187.ut8
>> containing pure UTF-8 with no BOM ever
> 
> If one were really going to do that, one would need three
> representations (pick your own three-character suffixes for the
> first two):
> 
> 	rfc8176.utf8   (standard/normal Unicode in UTF-8, no BOM)
> 	rfc8176.utf8-with-BOM (as above, but...)
> 	rfc8176.txt    (ASCII, with characters outside the ASCII
> repertoire expressed as \u'[N[N]]NNNN' (see RFC 5137) or
> another escaping system of the RFC Editor's choice.

I see your logic, but pragmatically today the implication of ".txt"
is not "pure ASCII", but rather "possibly ASCII, or possibly
UTF-8 if there's a BOM, or possibly some other random choice of
coded character set." There's very little to be done about the
third case, so I'm arguing for the second one as the lowest
common denominator of sorts.

"pure ASCII with some convention for Unicode escapes" might be
a good idea too, but I don't quite see how to get there from
here.

Regards,
    Brian

> Note that there is no  good reason to assume that a text
> file that contains octets outside the ASCII range is
> UTF-8, especially if the creation date is unknown.
> Historically, it could as easily be encoded as specified
> in one of ISO/IEC 8859-X standards, some proprietary
> code page, etc.)
> 
> Because "\u'2639'" requires significantly more horizontal space
> than "☹", the txt form with escapes would require some
> reformatting, but the native XML idea will solve all those
> problems, right?
> 
>      john
> 
>