RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

John C Klensin <john-ietf@xxxxxxx> · Thu, 10 Jan 2008 11:58:27 -0500

Kent,

I will try to address the comments that are essentially
editorial after the Last Call closes, but you have raised a few
points that have been discussed over and over again (not just in
the net-utf8 context) and that I think are worth my responding
to now (in addition to comments already made by others).  FWIW,
I'm working on several things right now that are of much higher
priority to me than net-utf8, so, especially since I think that
authors should generally be quiet during Last Call and let other
responses accumulate, are likely to cause comments from me to
come very slowly.

--On Monday, 07 January, 2008 22:30 +0100 Kent Karlsson
<kent.karlsson14@xxxxxxxxx> wrote:

> To be "rescrictive in what one emits and permissive/liberal in
> what one receives" might be applicable here.

What we have consistently found out about the robustness
principle is that it is extremely useful if used as a tool for
interpreting protocol requirements. Otherwise it is useful only
in moderation.  When we have explicit requirements that
receivers accept garbage, we have repeatedly seen the
combination of those requirements with the robustness principle
used by senders to say "we can send any sort of garbage and you
are required to clean it up".   That does not promote either
interoperability or a smoothly-running network.

The question of why net-utf8 expect receivers to clean up
normalization but does not expect them to tolerate aberrant
line-endings is a reasonable one, for which see below.  But I
think your invocation of the robustness principle is
inappropriate.

> Upon receipt, the following SHOULD be seen as at least line
> ending (or line separating), and in some cases more than that: 
> 
> LF, CR+LF, VT, CR+VT, FF, CR+FF, CR (not followed by NUL...),
> NEL, CR+NEL, LS, PS
> where
> LF	U+000A
> VT	U+000B
> FF	U+000C
> CR	U+000D
> NEL	U+0085
> LS	U+2028
> PS	U+2029
> 
> even FS, GS, RS
> where
> FS	U+001C
> GS	U+001D
> RS	U+001E
> should be seen as line separating (Unicode specifies these as
> having bidi property B, which effectively means they are
> paragraph separating).

There are two theories about how to construct a protocol.  One
puts all responsibility on the sender; the other puts all
responsibility on the receiver.  A different manifestation of
this is the difference between protocols with very few options
which are therefore supported in the same one by everyone and
protocols with many options that, in practice, are selectively
supported and for which finding an interoperable client-server
pair can be difficult. The "many options" and "receiver
responsibility" approaches tend, if nothing else, tend to create
N**2 problems, where every receiver has to understand every
possible sender format and option and combination of them,
rather than dealing with a single form to which everything is
converted on sending and, if necessary, de-converted on receipt.

While one often needs to seek a balance in practice, I and many
others believe that the Internet has been better served by the
first option in each of these pairings, resulting in a minimum
of variations on the wire.  People such as Marshall Rose and
Mike Padlipsky have been much more elegant on this subject than
I can be, but the Internet's history of "just one form on the
wire" for text types at the applications layer goes back to at
least RFC 20 (which is one reason it was cited in the current
draft).

There is also a security issue associated with this.  When there
is a single standard form, we know how to construct digital
signatures over the text.   When there are lots of things that
are to be "treated as" the same, and suggestions that various
systems along the line might appropriately convert one form into
another, methods for computing digital signatures need their own
canonicalization rules and rules about exactly when they are
applied.  That can be done, but, as I have suggested in several
other places in this note, why go out of our way to make our
lives more complex?

> Apart from CR+LF, these SHOULD NOT be emitted for net-utf8,
> unless that is overriden by the protocol specification (like
> allowing FF, or CR+FF). When faced with any of these in input
> **to be emitted as net-utf8**, each of these SHOULD be
> converted to a CR+LF (unless that is overridden by the
> protocol in question).

While I may not have succeeded (and, if I didn't, specific
suggestions would be welcome), the net-utf8 draft was intended
to be very specific that it didn't apply to protocols that
didn't reference it, nor was its use mandatory for new
protocols.  That means that a protocol doesn't need to
"override" it; it should just not reference it.  Yes, I think it
makes a new protocol harder to write if it doesn't reference
net-utf8 than if it does.  It may also generate "do you really
need to do this" pushback against such protocols and against
protocols that try "all of this _except_" references to this
document.  That is, IMO, as it should be.  But, if other forms
can be justified for particular applications, then they should
be.   

It just makes no sense, at least to me, to include a lot of text
in a spec whose purpose is to tell applications that don't use
the spec what to do.

> --------------------------
> 
> Section 2, point 3:
> 
> You have made an exception for FF (because they occur in
> RFCs?).

We made an exception for FF --while cautioning against its use--
because it is permitted in NVT and fairly widely in text streams
and because some reasonable interpretation of its semantics are
moderately well-understood.   On the other hand, it comes with
some cautions and, if there were consensus to remove the
exception, I wouldn't personally hesitate to do that.

Part of the issue here is separate from the "don't mess with the
other control characters" principle and the issue of
line-endings.  When ASCII came along, people were fairly
optimistic about using plain text streams to specify formatting
and page layout, partially because our expectations for those
things were very low (one available font with no variations,
always fixed-width and fixed-size).  We figured out a rather
long time ago that we needed markup  and printer-specific
controls to handle these things well (I believe the first
instance in code may have been Jerry Saltzer's original RUNOFF
in the first half of the 60s, but it has been a long time in any
event).  To the extent to which the output of those programs
requires "control characters", rather than, e.g., the "terminal
control" functions specified in ISO/IEC 6549 (more or less ANSI
X3.64 or ECMA48).   Today, we tend to use XML or other forms of
markup on the input side and are normally pretty clear that the
print-form output isn't a normal text stream of the type at
which net-utf8 is targeted.

> I think FF SHOULD be avoided, just like VT, NEL, and
> more (see above).

I think the cautions about use of FF are just about that strong,
but it does have significant current use (albeit not an Internet
text-stream line separator).   One could put HT with it and
explain that it should be used only when its interpretation (as
a fixed number of spaces or a jump to a well-establish column)
is known, but, because those are rarely known, I/we made another
choice.

> Even when it is allowed, it, and CR+FF,
> should be seen as line separating.

That has never been permitted in the protocols that reference,
even implicitly, NVT.  Why make things more complicated now by
(i) introducing more flexibility for its own sake and putting
more burden on receivers and (ii) giving bodies of text that use
FF a different interpretation under this specification than they
have under NVT.  Deliberately introducing incompatibilities, it
seems to me, requires much more justification than  added
flexibility.

> You have also (by implication) dismissed HT, U+0009. The
> reason for this in unclear. Especially since HT is so common
> in plain texts (often with some default tab setting). Mapping
> HT to SPs is often a bad idea. I don't think a default tab
> setting should be specified, but the effect of somewhat (not
> wildly) different defaults for that is not much worse than
> using variable width fonts.

But you have just summarized the reasons for avoiding HT.  We
don't have any standard that would give it unambiguous
semantics.  There is no way to incorporate tab column settings
(in characters or millimeters) in a text stream, so one can't
even disambiguate with an in-band option.  That makes HT
appropriate in marked-up text (which might or might not have
better ways to specify what is wanted) or when options are being
transmitted out of band, but not in running text streams.   If
there is consensus that this needs to be addressed more
explicitly in the document, we can try to do so.

--On Wednesday, 09 January, 2008 23:30 +0100 Kent Karlsson
<kent.karlsson14@xxxxxxxxx> wrote:

> B.t.w. many programs (long ago) had a bug that deleted the
> last line if if was not ended with a LF.

Not that long ago.  I discovered, at a recent IETF meeting, a
printer and printer driver that would drop the entire last page
of a document, or drop the document entirely, if it didn't end
in CR LF.  I think the technical term for that is "nasty bug",
not something that requires protocol changes.

> As an additional
> comment, I think that the Net-UTF-8 document should state
> that the last line need not be ended by CR+LF (or any other
> line end/separator), though it should be. This is just as a
> matter of normalising the line ends for Net-UTF8, not for
> UTF-8 in general.

So now End of Document (however that is expressed) is also a
line-ending?  Unfortunately, as we have discovered many times
with email, an implied line-ending gets one into lots of trouble
about just when it is implied and, in particular, whether
digital signatures should be computed with the normal
line-ending sequence inserted as implied or over the document as
sent.   Again, these problems are much more easily dealt with by
specifying explicitly what is to be put on the wire, making the
sending system convert things to that format as needed, and
treating bugs as bugs rather than justification for making the
standard forms more complex.

--On Thursday, 10 January, 2008 09:59 +0100 Kent Karlsson
<kent.karlsson14@xxxxxxxxx> wrote:

> As for the receiving side the same considerations as for the
> (SHOULD) requirement (point numbered 4 on page 4) for NFC in
> Net-UTF-8 applies. The reciever cannot be sure that NFC has
> been applied. Nor can it be sure that conversion of all line
> endings to CR+LF (there-by loosing information about their
> differences) has been applied.

This is, at least to me, a more interesting problem.  On the one
hand, there are no constraints due to backward compatibility
with NVT.  On the other, there are at least two real constraints:

(i) There is not a single normalization form.  Four are
standardized and others, for more or less specific purposes, are
floating around (e.g., without getting tied up in terminology
niceties about what is a normalization form and what is
something else, nameprep uses, to a first order approximation,
NFKC+lowercasing).  There has never been a clear recommendation
as to which one should used globally (The Unicode Standard
discusses considerations and tradeoffs... quite appropriately,
IMO).  In order to avoid chaos, some systems and packages force
particular normalizations on whatever passes through them (by
contrast, I'm not aware of anything that goes out of its way to
convert CRLF into NEL.  From a Unicode standpoint, it would make
more sense to convert CRLF to U+2028 (which, strangely to me,
doesn't appear on your list above) but, again, AFAK, no one does
that as a matter of routine either).   The net result of this is
that, if we have a string that starts out in some normalization
form (even NFC) that is then passed across the network, it may
then end up in the hands of the receiving subsystem in, e.g.,
NFD.  So it is important, pragmatically and whether we like it
or not, that the receiver check or apply normalization
regardless of what requirements me make on the sender.  The
digital signature issues are similar -- if one wants two bodies
of text to have the same signature value if they are considered
equivalent requires normalization to get even near equivalency.
Put differently, treating a body of text that is  unnormalized
on receipt as a bug to be rejected just doesn't make practical
sense, while treating text strewn with assorted characters that
might be line-ends (or, in some cases, might be something else)
doesn't.

thanks for the comments, thoughts, and careful reading.

    john

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf