Kent, I will try to address the comments that are essentially editorial after the Last Call closes, but you have raised a few points that have been discussed over and over again (not just in the net-utf8 context) and that I think are worth my responding to now (in addition to comments already made by others). FWIW, I'm working on several things right now that are of much higher priority to me than net-utf8, so, especially since I think that authors should generally be quiet during Last Call and let other responses accumulate, are likely to cause comments from me to come very slowly. --On Monday, 07 January, 2008 22:30 +0100 Kent Karlsson <kent.karlsson14@xxxxxxxxx> wrote: > To be "rescrictive in what one emits and permissive/liberal in > what one receives" might be applicable here. What we have consistently found out about the robustness principle is that it is extremely useful if used as a tool for interpreting protocol requirements. Otherwise it is useful only in moderation. When we have explicit requirements that receivers accept garbage, we have repeatedly seen the combination of those requirements with the robustness principle used by senders to say "we can send any sort of garbage and you are required to clean it up". That does not promote either interoperability or a smoothly-running network. The question of why net-utf8 expect receivers to clean up normalization but does not expect them to tolerate aberrant line-endings is a reasonable one, for which see below. But I think your invocation of the robustness principle is inappropriate. > Upon receipt, the following SHOULD be seen as at least line > ending (or line separating), and in some cases more than that: > > LF, CR+LF, VT, CR+VT, FF, CR+FF, CR (not followed by NUL...), > NEL, CR+NEL, LS, PS > where > LF U+000A > VT U+000B > FF U+000C > CR U+000D > NEL U+0085 > LS U+2028 > PS U+2029 > > even FS, GS, RS > where > FS U+001C > GS U+001D > RS U+001E > should be seen as line separating (Unicode specifies these as > having bidi property B, which effectively means they are > paragraph separating). There are two theories about how to construct a protocol. One puts all responsibility on the sender; the other puts all responsibility on the receiver. A different manifestation of this is the difference between protocols with very few options which are therefore supported in the same one by everyone and protocols with many options that, in practice, are selectively supported and for which finding an interoperable client-server pair can be difficult. The "many options" and "receiver responsibility" approaches tend, if nothing else, tend to create N**2 problems, where every receiver has to understand every possible sender format and option and combination of them, rather than dealing with a single form to which everything is converted on sending and, if necessary, de-converted on receipt. While one often needs to seek a balance in practice, I and many others believe that the Internet has been better served by the first option in each of these pairings, resulting in a minimum of variations on the wire. People such as Marshall Rose and Mike Padlipsky have been much more elegant on this subject than I can be, but the Internet's history of "just one form on the wire" for text types at the applications layer goes back to at least RFC 20 (which is one reason it was cited in the current draft). There is also a security issue associated with this. When there is a single standard form, we know how to construct digital signatures over the text. When there are lots of things that are to be "treated as" the same, and suggestions that various systems along the line might appropriately convert one form into another, methods for computing digital signatures need their own canonicalization rules and rules about exactly when they are applied. That can be done, but, as I have suggested in several other places in this note, why go out of our way to make our lives more complex? > Apart from CR+LF, these SHOULD NOT be emitted for net-utf8, > unless that is overriden by the protocol specification (like > allowing FF, or CR+FF). When faced with any of these in input > **to be emitted as net-utf8**, each of these SHOULD be > converted to a CR+LF (unless that is overridden by the > protocol in question). While I may not have succeeded (and, if I didn't, specific suggestions would be welcome), the net-utf8 draft was intended to be very specific that it didn't apply to protocols that didn't reference it, nor was its use mandatory for new protocols. That means that a protocol doesn't need to "override" it; it should just not reference it. Yes, I think it makes a new protocol harder to write if it doesn't reference net-utf8 than if it does. It may also generate "do you really need to do this" pushback against such protocols and against protocols that try "all of this _except_" references to this document. That is, IMO, as it should be. But, if other forms can be justified for particular applications, then they should be. It just makes no sense, at least to me, to include a lot of text in a spec whose purpose is to tell applications that don't use the spec what to do. > -------------------------- > > Section 2, point 3: > > You have made an exception for FF (because they occur in > RFCs?). We made an exception for FF --while cautioning against its use-- because it is permitted in NVT and fairly widely in text streams and because some reasonable interpretation of its semantics are moderately well-understood. On the other hand, it comes with some cautions and, if there were consensus to remove the exception, I wouldn't personally hesitate to do that. Part of the issue here is separate from the "don't mess with the other control characters" principle and the issue of line-endings. When ASCII came along, people were fairly optimistic about using plain text streams to specify formatting and page layout, partially because our expectations for those things were very low (one available font with no variations, always fixed-width and fixed-size). We figured out a rather long time ago that we needed markup and printer-specific controls to handle these things well (I believe the first instance in code may have been Jerry Saltzer's original RUNOFF in the first half of the 60s, but it has been a long time in any event). To the extent to which the output of those programs requires "control characters", rather than, e.g., the "terminal control" functions specified in ISO/IEC 6549 (more or less ANSI X3.64 or ECMA48). Today, we tend to use XML or other forms of markup on the input side and are normally pretty clear that the print-form output isn't a normal text stream of the type at which net-utf8 is targeted. > I think FF SHOULD be avoided, just like VT, NEL, and > more (see above). I think the cautions about use of FF are just about that strong, but it does have significant current use (albeit not an Internet text-stream line separator). One could put HT with it and explain that it should be used only when its interpretation (as a fixed number of spaces or a jump to a well-establish column) is known, but, because those are rarely known, I/we made another choice. > Even when it is allowed, it, and CR+FF, > should be seen as line separating. That has never been permitted in the protocols that reference, even implicitly, NVT. Why make things more complicated now by (i) introducing more flexibility for its own sake and putting more burden on receivers and (ii) giving bodies of text that use FF a different interpretation under this specification than they have under NVT. Deliberately introducing incompatibilities, it seems to me, requires much more justification than added flexibility. > You have also (by implication) dismissed HT, U+0009. The > reason for this in unclear. Especially since HT is so common > in plain texts (often with some default tab setting). Mapping > HT to SPs is often a bad idea. I don't think a default tab > setting should be specified, but the effect of somewhat (not > wildly) different defaults for that is not much worse than > using variable width fonts. But you have just summarized the reasons for avoiding HT. We don't have any standard that would give it unambiguous semantics. There is no way to incorporate tab column settings (in characters or millimeters) in a text stream, so one can't even disambiguate with an in-band option. That makes HT appropriate in marked-up text (which might or might not have better ways to specify what is wanted) or when options are being transmitted out of band, but not in running text streams. If there is consensus that this needs to be addressed more explicitly in the document, we can try to do so. --On Wednesday, 09 January, 2008 23:30 +0100 Kent Karlsson <kent.karlsson14@xxxxxxxxx> wrote: > B.t.w. many programs (long ago) had a bug that deleted the > last line if if was not ended with a LF. Not that long ago. I discovered, at a recent IETF meeting, a printer and printer driver that would drop the entire last page of a document, or drop the document entirely, if it didn't end in CR LF. I think the technical term for that is "nasty bug", not something that requires protocol changes. > As an additional > comment, I think that the Net-UTF-8 document should state > that the last line need not be ended by CR+LF (or any other > line end/separator), though it should be. This is just as a > matter of normalising the line ends for Net-UTF8, not for > UTF-8 in general. So now End of Document (however that is expressed) is also a line-ending? Unfortunately, as we have discovered many times with email, an implied line-ending gets one into lots of trouble about just when it is implied and, in particular, whether digital signatures should be computed with the normal line-ending sequence inserted as implied or over the document as sent. Again, these problems are much more easily dealt with by specifying explicitly what is to be put on the wire, making the sending system convert things to that format as needed, and treating bugs as bugs rather than justification for making the standard forms more complex. --On Thursday, 10 January, 2008 09:59 +0100 Kent Karlsson <kent.karlsson14@xxxxxxxxx> wrote: > As for the receiving side the same considerations as for the > (SHOULD) requirement (point numbered 4 on page 4) for NFC in > Net-UTF-8 applies. The reciever cannot be sure that NFC has > been applied. Nor can it be sure that conversion of all line > endings to CR+LF (there-by loosing information about their > differences) has been applied. This is, at least to me, a more interesting problem. On the one hand, there are no constraints due to backward compatibility with NVT. On the other, there are at least two real constraints: (i) There is not a single normalization form. Four are standardized and others, for more or less specific purposes, are floating around (e.g., without getting tied up in terminology niceties about what is a normalization form and what is something else, nameprep uses, to a first order approximation, NFKC+lowercasing). There has never been a clear recommendation as to which one should used globally (The Unicode Standard discusses considerations and tradeoffs... quite appropriately, IMO). In order to avoid chaos, some systems and packages force particular normalizations on whatever passes through them (by contrast, I'm not aware of anything that goes out of its way to convert CRLF into NEL. From a Unicode standpoint, it would make more sense to convert CRLF to U+2028 (which, strangely to me, doesn't appear on your list above) but, again, AFAK, no one does that as a matter of routine either). The net result of this is that, if we have a string that starts out in some normalization form (even NFC) that is then passed across the network, it may then end up in the hands of the receiving subsystem in, e.g., NFD. So it is important, pragmatically and whether we like it or not, that the receiver check or apply normalization regardless of what requirements me make on the sender. The digital signature issues are similar -- if one wants two bodies of text to have the same signature value if they are considered equivalent requires normalization to get even near equivalency. Put differently, treating a body of text that is unnormalized on receipt as a bug to be rejected just doesn't make practical sense, while treating text strewn with assorted characters that might be line-ends (or, in some cases, might be something else) doesn't. thanks for the comments, thoughts, and careful reading. john _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf