Re: Last Call: draft-klensin-net-utf8 (Unicode Format for NetworkInterchange) to Proposed Standard

John C Klensin <john-ietf@xxxxxxx> · Thu, 10 Jan 2008 12:19:12 -0500

--On Thursday, 10 January, 2008 15:21 +0100 Frank Ellermann
<nobody@xxxxxxxxxxxxxxxxx> wrote:

>...
> Hopefully somebody can confirm that IND is correct, or not.
> For HT and FF I hope the final version will somehow express
> that both are not really bad, and as far as they're bad FF is
> worse than HT. 

I'm open to consensus about changes for either HT or FF, but the
theory of "bad" that was used to construct the spec was:

(i) If a "spacing" control has the effect of setting the
position of the next character, it is "bad" unless that position
is unambiguous.   In addition, things are "bad" unless they are
necessary in running text (as distinct from faking things that
are better handled in markup, followed by either device-specific
output or standard page representations, neither of which are
normal text).

It is unambiguous for SP.  It is unambiguous for CRLF.
Independent of the "what is a line-end" problem, it is somewhat
ambiguous for CR or LF alone and for IND.  It is ambiguous for
HT.  It would be ambiguous for FF except that FF is assigned
fairly clear semantics in NVT -- "FF" is not a line ending (CRLF
FF is needed) and as Bob Braden noted, there is a fairly clear
rule that FF is to be interpreted as "top of next page" if one
knows what a page is and as "blank line" otherwise.  But that
rule is sufficiently often ignored to call for considerable
caution about FF, and the text now contains a cautionary note
for that reason.

There is an interesting demonstration of the law of unintended
consequences here.  If we could tell that a string was
unambiguously UTF-8 (or whatever) by looking at it, even if it
contains nothing but ASCII characters, then there would be no
reason to try to make net-utf8 a proper superset of NVT.  If we
could do that, we could also do away with the entire "next line"
debate by prohibiting even CRLF and requiring the use of LS
(U+2028).  In retrospect, there might have been considerable
advantages to forcing the ASCII- UTF-8 distinction by requiring
that UTF-8 strings all start with a BOM, but it is far too late
for that (and probably not, on balance, a good idea despite its
advantages).  So I don't see how to get there from here -- we
are stuck, for historical reasons, with CRLF on the wire as what
The Unicode Standard calls NLF (incidentally, Unicode 5.0,
Section 5.8, provides significant insight into the complexity of
this problem and probably should have been referenced.  It would
be even more helpful had Table 5-2 included identifying CRLF as
a standard Internet "wire" form of NLF, not just binding that
form to Windows.

> My impression from reading the draft was exactly the opposite,
> FF not too bad, HT really bad, that's odd for protocols
> allowing WSP.

See above.

    john

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf