On 10/10/2017 06:14, John C Klensin wrote: > Sorry for the delay -- catching up on this thread after > temporarily giving up on it. > > --On Wednesday, September 27, 2017 08:38 +1300 Brian E Carpenter > <brian.e.carpenter@xxxxxxxxx> wrote: > >> So why don't we, the Internet standards people who believe in >> rough consensus and running code, request the RFC Editor (a >> friend of ours) to supply two text versions of each RFC, like >> >> https://www.rfc-editor.org/rfc/rfc8187.txt as today, with >> BOM if relevant >> https://www.rfc-editor.org/rfc/rfc8187.ut8 >> containing pure UTF-8 with no BOM ever > > If one were really going to do that, one would need three > representations (pick your own three-character suffixes for the > first two): > > rfc8176.utf8 (standard/normal Unicode in UTF-8, no BOM) > rfc8176.utf8-with-BOM (as above, but...) > rfc8176.txt (ASCII, with characters outside the ASCII > repertoire expressed as \u'[N[N]]NNNN' (see RFC 5137) or > another escaping system of the RFC Editor's choice. I see your logic, but pragmatically today the implication of ".txt" is not "pure ASCII", but rather "possibly ASCII, or possibly UTF-8 if there's a BOM, or possibly some other random choice of coded character set." There's very little to be done about the third case, so I'm arguing for the second one as the lowest common denominator of sorts. "pure ASCII with some convention for Unicode escapes" might be a good idea too, but I don't quite see how to get there from here. Regards, Brian > Note that there is no good reason to assume that a text > file that contains octets outside the ASCII range is > UTF-8, especially if the creation date is unknown. > Historically, it could as easily be encoded as specified > in one of ISO/IEC 8859-X standards, some proprietary > code page, etc.) > > Because "\u'2639'" requires significantly more horizontal space > than "☹", the txt form with escapes would require some > reformatting, but the native XML idea will solve all those > problems, right? > > john > >