RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

"Kent Karlsson" <kent.karlsson14@xxxxxxxxx> · Mon, 14 Jan 2008 11:11:27 +0100

John C Klensin wrote:

> Kent,
> 
> I will try to address the comments that are essentially
> editorial after the Last Call closes, but you have raised a few
> points that have been discussed over and over again (not just in

I raised a number of non-editorial issues that you did not
address below...

> the net-utf8 context) and that I think are worth my responding
> to now (in addition to comments already made by others).  FWIW,
> I'm working on several things right now that are of much higher
> priority to me than net-utf8, so, especially since I think that
> authors should generally be quiet during Last Call and let other
> responses accumulate, are likely to cause comments from me to
> come very slowly.

I think this document is at least one draft, maybe two or three
drafts, away from being of sufficient clarity and of sufficient
quality to become a standards document. In addition you state that
you don't have time right now to deal with this. I would therefore
suggest that the document be withdrawn from last call, to allow
time for clearing up the document.

> > To be "rescrictive in what one emits and permissive/liberal in
> > what one receives" might be applicable here.
> 
> What we have consistently found out about the robustness
> principle is that it is extremely useful if used as a tool for
> interpreting protocol requirements. Otherwise it is useful only
> in moderation.  When we have explicit requirements that
> receivers accept garbage, we have repeatedly seen the
> combination of those requirements with the robustness principle
> used by senders to say "we can send any sort of garbage and you
> are required to clean it up".   That does not promote either
> interoperability or a smoothly-running network.
> 
> The question of why net-utf8 expect receivers to clean up
> normalization but does not expect them to tolerate aberrant
> line-endings is a reasonable one, for which see below.  But I

I would not refer to most other line separators (that is how they
are best seen) as in any way "aberrant". Except for RS, GS, FS and
IND, they are not aberrant, and they are in no way malformed,
irregular, or any such. I agree that the situation is not ideal.
But I do think it is perfectly manageable without undue effort.

I do, however, regard using pure CR or BS to achieve accenting
or underlining to be highly aberrant, malformed, irregular and
unmanageable. (But that aberrant zoo you want to keep...)

> think your invocation of the robustness principle is
> inappropriate.

I'm not sure why...

> > Upon receipt, the following SHOULD be seen as at least line
> > ending (or line separating), and in some cases more than that: 
> > 
> > LF, CR+LF, VT, CR+VT, FF, CR+FF, CR (not followed by NUL...),
> > NEL, CR+NEL, LS, PS
> > where
> > LF	U+000A
> > VT	U+000B
> > FF	U+000C
> > CR	U+000D
> > NEL	U+0085
> > LS	U+2028
> > PS	U+2029
> > 
> > even FS, GS, RS
> > where
> > FS	U+001C
> > GS	U+001D
> > RS	U+001E
> > should be seen as line separating (Unicode specifies these as
> > having bidi property B, which effectively means they are
> > paragraph separating).
> 

> There is also a security issue associated with this.  When there
> is a single standard form, we know how to construct digital
> signatures over the text.   When there are lots of things that
> are to be "treated as" the same, and suggestions that various
> systems along the line might appropriately convert one form into
> another, methods for computing digital signatures need their own
> canonicalization rules and rules about exactly when they are
> applied.  That can be done, but, as I have suggested in several
> other places in this note, why go out of our way to make our
> lives more complex?

You already have NFC as a SHOULD not a SHALL. Which makes your
argument here entirely moot.

> > Apart from CR+LF, these SHOULD NOT be emitted for net-utf8,
> > unless that is overriden by the protocol specification (like
> > allowing FF, or CR+FF). When faced with any of these in input
> > **to be emitted as net-utf8**, each of these SHOULD be
> > converted to a CR+LF (unless that is overridden by the
> > protocol in question).
> 
> While I may not have succeeded (and, if I didn't, specific
> suggestions would be welcome), the net-utf8 draft was intended
> to be very specific that it didn't apply to protocols that
> didn't reference it, nor was its use mandatory for new
> protocols.

Agreed. But that is not related to any of my comments.

>  That means that a protocol doesn't need to
> "override" it; it should just not reference it.  Yes, I think it
> makes a new protocol harder to write if it doesn't reference
> net-utf8 than if it does.  It may also generate "do you really
> need to do this" pushback against such protocols and against
> protocols that try "all of this _except_" references to this
> document.  That is, IMO, as it should be.  But, if other forms
> can be justified for particular applications, then they should
> be.   

I think it is perfectly reasonable for a protocol to define a
"profile" of Net-UTF-8; e.g. saying that "[use Net-UTF-8] except that
FF and CR+FF are allowed, and that FF is converted to CR+FF [while
normally those would have been converted to CR+LF]." Note: that
was only an example.

> It just makes no sense, at least to me, to include a lot of text
> in a spec whose purpose is to tell applications that don't use
> the spec what to do.

Of course not. I did not say so either.

> > You have made an exception for FF (because they occur in
> > RFCs?).
> 
> We made an exception for FF --while cautioning against its use--
> because it is permitted in NVT and fairly widely in text streams
> and because some reasonable interpretation of its semantics are
> moderately well-understood.   On the other hand, it comes with
> some cautions and, if there were consensus to remove the
> exception, I wouldn't personally hesitate to do that.

See my original comment, and above, for how I think this should be
resolved.

> > I think FF SHOULD be avoided, just like VT, NEL, and
> > more (see above).
> 
> I think the cautions about use of FF are just about that strong,
> but it does have significant current use (albeit not an Internet
> text-stream line separator).   One could put HT with it and
> explain that it should be used only when its interpretation (as
> a fixed number of spaces or a jump to a well-establish column)
> is known, but, because those are rarely known, I/we made another
> choice.
> 
> > Even when it is allowed, it, and CR+FF,
> > should be seen as line separating.
> 
> That has never been permitted in the protocols that reference,
> even implicitly, NVT.  Why make things more complicated now by
> (i) introducing more flexibility for its own sake and putting
> more burden on receivers and (ii) giving bodies of text that use
> FF a different interpretation under this specification than they

FF is always line separating (though it has been rarely used,
fortunately): if you change page, the line preceding the change of
page is of course ended (though the paragraph need not end).
Even when interpreted as an "empty line", it is line separating
(as it should be).

> have under NVT.  Deliberately introducing incompatibilities, it
> seems to me, requires much more justification than  added
> flexibility.
> 
> > You have also (by implication) dismissed HT, U+0009. The
> > reason for this in unclear. Especially since HT is so common
> > in plain texts (often with some default tab setting). Mapping
> > HT to SPs is often a bad idea. I don't think a default tab
> > setting should be specified, but the effect of somewhat (not
> > wildly) different defaults for that is not much worse than
> > using variable width fonts.
> 
> But you have just summarized the reasons for avoiding HT.  We

No I have not. You tried to give some arguments, but I'm far from
persuaded. (Nor, it seems, is Frank Ellerman.) In practice the "problems"
are not worse that those of using variable width fonts. And that is
just about an absolute necessity when going beyond Latin/Greek/Cyrillic,
and also commonly used for Latin/Greek/Cyrillic, however detrimental
it is for "ASCII art". Unless the default tab setting is really
wacko, which it usually isn't.

> don't have any standard that would give it unambiguous
> semantics.  There is no way to incorporate tab column settings
> (in characters or millimeters) in a text stream, so one can't

Yes there is:

0088;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION SET;;;;

**NOT** that I suggest using that! Definitely not! I'm just pointing
out that there **is** an already defined control code for setting
tab stops, that however has clear disadvantages, and is outdated.

> even disambiguate with an in-band option.  That makes HT
> appropriate in marked-up text (which might or might not have
> better ways to specify what is wanted) or when options are being
> transmitted out of band, but not in running text streams.   If
> there is consensus that this needs to be addressed more
> explicitly in the document, we can try to do so.

I (and apparently at least also Frank Ellerman) think that HT
should be allowed in Net-UTF-8. The default settings for tab stops
for plain text seems to work well.

You silently seems to suggest that original HT should be replaced
by one or more spaces. But how many spaces in each instance? I think
it would be better to keep the HT (which I agree is not an ideal
character, but it is very common) as is. Note that I do NOT suggest
to ever replace spaces with HT. Doing that would be a really bad
idea (but still seen sometimes, with ill effects; like for the
subject line I "got" for this message...).

> > B.t.w. many programs (long ago) had a bug that deleted the
> > last line if if was not ended with a LF.
> 
> Not that long ago.  I discovered, at a recent IETF meeting, a
> printer and printer driver that would drop the entire last page
> of a document, or drop the document entirely, if it didn't end
> in CR LF.  I think the technical term for that is "nasty bug",
> not something that requires protocol changes.

As the current Net-UTF-8 draft is written, that printer behaviour
seems entirely within what is permissible (for printing, say,
Net-UTF-8 plain text documents).

What do expect to happen if other line separation than CR+LF
is/are used? Rejection of text, error messages, ignoring/deleting
them, treating them as spaces, or what?

> > As an additional
> > comment, I think that the Net-UTF-8 document should state
> > that the last line need not be ended by CR+LF (or any other
> > line end/separator), though it should be. This is just as a
> > matter of normalising the line ends for Net-UTF8, not for
> > UTF-8 in general.
> 
> So now End of Document (however that is expressed) is also a
> line-ending?  

Unless the text piece is used as a fragment (to be inserted/appended
to something else), end-of-document without explicit line-end should
end the (last) line, rather than be an error.

(And if there is a page paradigm, end-of-document (for a "complete"
document, i.e. not used as a fragment) also implies end-of-last-page,
even if there is no FF at the end.)

> Unfortunately, as we have discovered many times
> with email, an implied line-ending gets one into lots of trouble
> about just when it is implied and, in particular, whether
> digital signatures should be computed with the normal
> line-ending sequence inserted as implied or over the document as
> sent.   Again, these problems are much more easily dealt with by
> specifying explicitly what is to be put on the wire, making the
> sending system convert things to that format as needed, and
> treating bugs as bugs rather than justification for making the
> standard forms more complex.
> 
> > 
> > As for the receiving side the same considerations as for the
> > (SHOULD) requirement (point numbered 4 on page 4) for NFC in
> > Net-UTF-8 applies. The reciever cannot be sure that NFC has
> > been applied. Nor can it be sure that conversion of all line
> > endings to CR+LF (there-by loosing information about their
> > differences) has been applied.
> 
> This is, at least to me, a more interesting problem.  On the one
> hand, there are no constraints due to backward compatibility
> with NVT.  On the other, there are at least two real constraints:
> 
> (i) There is not a single normalization form.  Four are
> standardized and others, for more or less specific purposes, are
> floating around (e.g., without getting tied up in terminology
> niceties about what is a normalization form and what is
> something else, nameprep uses, to a first order approximation,
> NFKC+lowercasing).  There has never been a clear recommendation
> as to which one should used globally (The Unicode Standard
> discusses considerations and tradeoffs... quite appropriately,
> IMO).  In order to avoid chaos, some systems and packages force
> particular normalizations on whatever passes through them (by
> contrast, I'm not aware of anything that goes out of its way to
> convert CRLF into NEL.  From a Unicode standpoint, it would make

Conversion to EBCDIC usually converts line endings (like CRLF) to NEL.
It is not absolutely certain that NEL is converted to another
line ending/separation upon conversion to a non-EBCDIC encoding.

Unicode normalisation conversion is also, in principle and as yet,
much less likely than line ending conversion.

> more sense to convert CRLF to U+2028 (which, strangely to me,
> doesn't appear on your list above) but, again, AFAK, no one does
> that as a matter of routine either).   The net result of this is

U+2028 does occur in my list above. IIRC it is used as "native"
line separator in at least one system (SymbianOS). Some programs
in other systems can also save files using LS as line separator.

> that, if we have a string that starts out in some normalization
> form (even NFC) that is then passed across the network, it may
> then end up in the hands of the receiving subsystem in, e.g.,
> NFD.  So it is important, pragmatically and whether we like it
> or not, that the receiver check or apply normalization
> regardless of what requirements me make on the sender.  The
> digital signature issues are similar -- if one wants two bodies
> of text to have the same signature value if they are considered
> equivalent requires normalization to get even near equivalency.
> Put differently, treating a body of text that is  unnormalized
> on receipt as a bug to be rejected just doesn't make practical
> sense, while treating text strewn with assorted characters that
> might be line-ends (or, in some cases, might be something else)
> doesn't.
> 
> thanks for the comments, thoughts, and careful reading.

As I mentioned, I think the document we are discussing needs a few
more drafts before it may be in good enough shape to be reissued
as "Last Call".

	/Kent Karlsson

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf