RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



John C Klensin wrote:

--Frank Ellermann wrote:
> >...
> > Hopefully somebody can confirm that IND is correct, or not.
> > For HT and FF I hope the final version will somehow express
> > that both are not really bad, and as far as they're bad FF is
> > worse than HT. 

See http://www.itscj.ipsj.or.jp/ISO-IR/077.pdf, which, somewhat
to my surprise, says that IND is an LF clone. However, IND has
long been deprecated, and never got any noticable use, and is even
REMOVED from ECMA 48. So I think it is safe to ignore IND. Indeed,
I would prefer it not be mentioned in the document we're discussing.

(I would like to say the same about NEL, but NEL is alive
and the native line separator/terminator in EBCDIC bases systems,
and may escape as NEL rather than be converted to something else.)

> I'm open to consensus about changes for either HT or FF, but the
> theory of "bad" that was used to construct the spec was:
> 
> (i) If a "spacing" control has the effect of setting the
> position of the next character, it is "bad" unless that position
> is unambiguous.   In addition, things are "bad" unless they are
> necessary in running text (as distinct from faking things that
> are better handled in markup, followed by either device-specific
> output or standard page representations, neither of which are
> normal text).

There is also another issue. If HT is converted to (presumably)
a sequence of SP, you will mess up bidi text. (See one of the
other mails I send at about the same time as this one.)

> It is unambiguous for SP.  It is unambiguous for CRLF.
> Independent of the "what is a line-end" problem, it is somewhat
> ambiguous for CR or LF alone and for IND.  It is ambiguous for

Even though IND was, for some strange reason, defined as an LF
clone, it has long been deprecated, and AFAIK never saw any
popular use. I think it is best left forgotten and left in silence.
Note also that it is not only deprecated, but even REMOVED from
ISO/IEC 6429 (ECMA 48).

> HT.  It would be ambiguous for FF except that FF is assigned
> fairly clear semantics in NVT -- "FF" is not a line ending

Of course it is line ending. So is "raw" LF. That the new line
(under some circumstances) may be strangely indented is irrelevant.

> (CRLF FF is needed)

That is a combination I haven't heard of before and I DON'T
think it should be regarded as one NLF. There are TWO NLFs there,
CRLF and then FF.

> and as Bob Braden noted, there is a fairly clear
> rule that FF is to be interpreted as "top of next page" if one

Sure. But the line before it is also ended (no matter where the
top of next page line begins).

> knows what a page is and as "blank line" otherwise.  But that
> rule is sufficiently often ignored to call for considerable
> caution about FF, and the text now contains a cautionary note
> for that reason.

I agree that there should be caution, but not in the shape and
form it has in the draft we are discussing.

> There is an interesting demonstration of the law of unintended
> consequences here.  If we could tell that a string was
> unambiguously UTF-8 (or whatever) by looking at it, even if it
> contains nothing but ASCII characters, then there would be no
> reason to try to make net-utf8 a proper superset of NVT.  If we

I don't see why you really need to carry on the (unworkable in
a more general setting than ASCII, in particular it is unworkable
for the UCS) idea of using carriage return and BS for strange
overstriking. Even for ASCII, the ONLY aspect of that that worked
moderately well was using <BS, _> (or similar) for underlining.
But note also that underlining can be achieved also in the UCS
(without using kludges line <BS, _> for that) without the use
of a higher level protocol by instead using U+0332, COMBINING LOW
LINE. Though using a higher level protocol for getting underlining
is preferable (consider searching),  COMBINING LOW LINE would
still be much preferable over <BS, _> (or similar).

> could do that, we could also do away with the entire "next line"
> debate by prohibiting even CRLF and requiring the use of LS

LS would be a bad idea. See my other email (sent at approx. the
same time as this one). You would get (to you) unexpected effects
from bidi processing.

		/Kent Karlsson


> (U+2028).  In retrospect, there might have been considerable
> advantages to forcing the ASCII- UTF-8 distinction by requiring
> that UTF-8 strings all start with a BOM, but it is far too late
> for that (and probably not, on balance, a good idea despite its
> advantages).  So I don't see how to get there from here -- we
> are stuck, for historical reasons, with CRLF on the wire as what
> The Unicode Standard calls NLF (incidentally, Unicode 5.0,
> Section 5.8, provides significant insight into the complexity of
> this problem and probably should have been referenced.  It would
> be even more helpful had Table 5-2 included identifying CRLF as
> a standard Internet "wire" form of NLF, not just binding that
> form to Windows.
_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf

[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]