Frank Ellermann wrote: > John C Klensin wrote: > > > It is ambiguous for HT. > > Yes, but we typically don't care about this in protocols as > long as it behaves like one or more spaces. I think that's Well, they don't exactly behave like a sequence of spaces. (See below.) > the idea of "WSP = SP / HTAB ; white space" in RFC 4234bis, > waiting for its STD number. > > We talked about the 4234bis issue of "trailing white space", > which could cause havoc when it is silently removed, and a > "really empty line" is not the same as an "apparently empty > line" (i.e. CRLF CRLF vs. CRLF 1*WSP CRLF). > > A similar robustness principle would support to accept old > "HTAB-compression" or "HTAB-beautification" (e.g. as first Doing what you call 'old "HTAB-compression"' is a bad idea, for several reasons (that I don't detail here, but for one: see below). > character in a folded line). In other words WSP, not only > SP. It is clear that the outcome is ambiguous, but in some > protocols I care about (headers in MIME, mail, and news) > *WSP or 1*WSP are acceptable. Admittedly it is a pain when > signatures need white space canonicalization. But replacing > *WSP by *SP would only simplify this step, not get rid of it. > > [About CRLF] > > Unicode 5.0, Section 5.8, provides significant insight into > > the complexity of this problem and probably should have > > been referenced. It would be even more helpful had Table > > 5-2 included identifying CRLF as a standard Internet "wire" > > form of NLF, not just binding that form to Windows. > > Indeed, this chapter offers significantly *broken* insight > for our purposes. What they found was a horrible mess, then Hmm, a minor mess, but not that horrible. > they introduced wannabe-unambiguous LS + PS, and what they > arrived at was messier than before. Claiming that CRLF is They were introduced in Unicode 1.1, long before the text for section 5.8 was drafted (originally as UTS 10). One important point that you have missed is that LS and PS, and the difference between THEM, are essential to the bidi algorithm. What is or may be done with other NLFs is basically a hack (most NLF are treated as if they were PS). Note also that two LSes in sequence don't make a PS... Assume for the moment that we were using only LS (as John has suggested as a possible ideal). This would imply that the bidi algorithm would consider the ENTIRE document, however many thousands of pages it may span, as a SINGLE paragraph. Thus, if there is any bidi processing, none of the text can be displayed until the entire text has been read in and bidi levelled as a whole, etc. That may have some display effects (bidi controls, LRE, RLE, LRO, RLO, span at most to the end of a paragraph, so if paragraph ends are replaced by LS, the bidi controls may span more text than they did originally). The hack is to regard all NLFs except LS, VT, and FF as PS. Since CRLF (say) may occur inside of what is actually a paragraph, this has some display effects (limiting bidi controls range more than they were originally), but at least bidi processing can be done piece by piece of the text. Bidi control codes are not talked about in the document we are discussing... > "windows" is odd for DOS + OS/2 users, it is also at odds > with numerous Internet standards - precisely the reason why > we need your draft. > > The chapter talks about line and paragraph separators without > mentioning relevant ASCII controls such as RS. On the other RS (and GS and FS) are regarded the same as PS for bidi processing, even though they are not mentioned in section 5.8. But I would agree that using RS, GS, FS, or (even worse) IND would be aberrant. US is regarded as similar to a HT for bidi processing (so should HTJ, but isn't by default). Note that HT is NOT treated the same as a sequence of spaces for bidi processing. HT always has the paragraph bidi level, which is not necessarily the case for spaces. This DOES affect display, in that HT always "moves" according to the paragraph level, while spaces may (and often do) move opposite to the paragraph level. So **DON'T** imply that HT should be replaced by spaces; such a replacement WILL have ill display effects. However, I do think that these four characters should not be used. > hand it mentions MS Word interna which are nobody's business > outside of MS Word. I guess you refer to VT (LINE TABULATION). The real reason it is mentioned is that the C and C++ standards give a special escape for it (\v, which according to C implies a return to beginning of line). If it were not for that, I would agree that VT is not very interesting (though it does provide for a hack to distinguish line separation from paragraph separation by ignoring the "tabulation" aspect of VT, also for pure 8-bit character encodings). /Kent Karlsson > It is interesting, but IMO unusable for net-utf8.
_______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf