Re: So do both [was Re: Should the IETF be condoning, even promoting, BOM pollution?]

John C Klensin <john-ietf@xxxxxxx> · Mon, 09 Oct 2017 22:29:34 -0400

--On Monday, October 9, 2017 16:36 -0700 "Heather Flanagan (RFC
Series Editor)" <rse@xxxxxxxxxxxxxx> wrote:

> On 10/9/17 10:14 AM, John C Klensin wrote:
>> --On Wednesday, September 27, 2017 08:38 +1300 Brian E
>> Carpenter <brian.e.carpenter@xxxxxxxxx> wrote:
>> 
>>> So why don't we, the Internet standards people who believe in
>>> rough consensus and running code, request the RFC Editor (a
>>> friend of ours) to supply two text versions of each RFC, like
>>> 
>>> https://www.rfc-editor.org/rfc/rfc8187.txt   as today, with
>>> BOM if relevant 
>>> https://www.rfc-editor.org/rfc/rfc8187.ut8
>>> containing pure UTF-8 with no BOM ever
>> If one were really going to do that, one would need three
>> representations (pick your own three-character suffixes for
>> the first two):
>> 
>> 	rfc8176.utf8   (standard/normal Unicode in UTF-8, no BOM)
>> 	rfc8176.utf8-with-BOM (as above, but...)
>> 	rfc8176.txt    (ASCII, with characters outside the ASCII
>> repertoire expressed as \u'[N[N]]NNNN' (see RFC 5137) or
>> another escaping system of the RFC Editor's choice.
> 
> 
> A few points to consider. First, the RFC Editor will review,
> at least to some extent, every file we produce, and our tools
> will need to be modified to create the additional formats;
> that complexity would then need to be maintained going
> forward. The more files added, the more resources it will take
> to produce. This has implications for either the time it takes
> to publish or the cost it takes to publish. Second, there have
> also been some discussions about creating separate files for
> paginated versus unpaginated text files. That would take us up
> to six files just for the plain-text outputs (noting the RFC
> Editor also has the PDF/A-3 and HTML to review).
> 
> Alternatively, the IETF community that prefers plain text can
> develop tools that takes the one file created by the RFC
> Editor and strip the BOM, add pagination, or run it through a
> translation tool to get it in their native language--these
> will not be produced or reviewed by the RFC Editor, but will
> perhaps meet the individual desires here. Given the number of
> options, opinions, and resources involved, I think this makes
> the most sense.

Up to a point, yes.  On the other hand, unless the RFC Editor
intends to make a rule requiring either that sections (or
subsections) not extend over circa a page, or numbering lines,
or doing something else that facilities references into a
document, I think you'd best retain a canonical / distributed
version with page numbers, headers, and footers.  That
information is a lot easier to remove than it is to reliably add.

Sorry I wasn't more clear that the suggestion above was a bit
tongue in cheek for more or less the reasons you identify.   But
I am concerned about the use of "txt" to identify files that are
not entirely ASCII.  The reasons have to do with guessing at
encodings (a far more complex question than whether or not there
are octets present that might be a BOM) and have, IMO, been
discussed (at great length) elsewhere.

    john