RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

"Karlsson, Kent" <kent.karlsson@xxxxxxxxxxxxxxx> · Mon, 14 Jan 2008 11:08:37 +0100

Frank Ellermann wrote:

> John C Klensin wrote:
> 
> > It is ambiguous for HT.
> 
> Yes, but we typically don't care about this in protocols as
> long as it behaves like one or more spaces.  I think that's

Well, they don't exactly behave like a sequence of spaces.
(See below.)

> the idea of "WSP = SP / HTAB ; white space" in RFC 4234bis,
> waiting for its STD number.
> 
> We talked about the 4234bis issue of "trailing white space",
> which could cause havoc when it is silently removed, and a
> "really empty line" is not the same as an "apparently empty
> line" (i.e. CRLF CRLF vs. CRLF 1*WSP CRLF).
> 
> A similar robustness principle would support to accept old
> "HTAB-compression" or "HTAB-beautification" (e.g. as first

Doing what you call 'old "HTAB-compression"' is a bad idea,
for several reasons (that I don't detail here, but for one:
see below).

> character in a folded line).  In other words WSP, not only
> SP.  It is clear that the outcome is ambiguous, but in some
> protocols I care about (headers in MIME, mail, and news)
> *WSP or 1*WSP are acceptable.   Admittedly it is a pain when
> signatures need white space canonicalization.  But replacing
> *WSP by *SP would only simplify this step, not get rid of it.
> 
>  [About CRLF]
> > Unicode 5.0, Section 5.8, provides significant insight into
> > the complexity of this problem and probably should have
> > been referenced.  It would be even more helpful had Table
> > 5-2 included identifying CRLF as a standard Internet "wire"
> > form of NLF, not just binding that form to Windows.
> 
> Indeed, this chapter offers significantly *broken* insight
> for our purposes.  What they found was a horrible mess, then

Hmm, a minor mess, but not that horrible.

> they introduced wannabe-unambiguous LS + PS, and what they
> arrived at was messier than before.  Claiming that CRLF is 

They were introduced in Unicode 1.1, long before the text for
section 5.8 was drafted (originally as UTS 10).

One important point that you have missed is that LS and PS,
and the difference between THEM, are essential to the bidi
algorithm. What is or may be done with other NLFs is basically
a hack (most NLF are treated as if they were PS).

Note also that two LSes in sequence don't make a PS...

Assume for the moment that we were using only LS (as John
has suggested as a possible ideal). This would imply that
the bidi algorithm would consider the ENTIRE document,
however many thousands of pages it may span, as a SINGLE
paragraph. Thus, if there is any bidi processing, none of
the text can be displayed until the entire text has been
read in and bidi levelled as a whole, etc. That may have
some display effects (bidi controls, LRE, RLE, LRO, RLO,
span at most to the end of a paragraph, so if paragraph
ends are replaced by LS, the bidi controls may span more
text than they did originally). The hack is to regard all
NLFs except LS, VT, and FF as PS. Since CRLF (say) may occur
inside of what is actually a paragraph, this has some display
effects (limiting bidi controls range more than they were
originally), but at least bidi processing can be done piece
by piece of the text.

Bidi control codes are not talked about in the document
we are discussing...

> "windows" is odd for DOS + OS/2 users, it is also at odds
> with numerous Internet standards - precisely the reason why
> we need your draft.  
> 
> The chapter talks about line and paragraph separators without
> mentioning relevant ASCII controls such as RS.  On the other

RS (and GS and FS) are regarded the same as PS for bidi
processing, even though they are not mentioned in section 5.8.
But I would agree that using RS, GS, FS, or (even worse) IND
would be aberrant.

US is regarded as similar to a HT for bidi processing (so should
HTJ, but isn't by default). Note that HT is NOT treated the same
as a sequence of spaces for bidi processing. HT always has the
paragraph bidi level, which is not necessarily the case for spaces.
This DOES affect display, in that HT always "moves" according to
the paragraph level, while spaces may (and often do) move opposite
to the paragraph level. So **DON'T** imply that HT should be 
replaced by spaces; such a replacement WILL have ill display
effects.

However, I do think that these four characters should not be
used.

> hand it mentions MS Word interna which are nobody's business
> outside of MS Word.

I guess you refer to VT (LINE TABULATION). The real reason it
is mentioned is that the C and C++ standards give a special
escape for it (\v, which according to C implies a return to
beginning of line). If it were not for that, I would agree
that VT is not very interesting (though it does provide for
a hack to distinguish line separation from paragraph separation
by ignoring the "tabulation" aspect of VT, also for pure 8-bit
character encodings).

	/Kent Karlsson

> It is interesting, but IMO unusable for net-utf8.
_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf