RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

"Kent Karlsson" <kent.karlsson14@xxxxxxxxx> · Mon, 7 Jan 2008 22:30:51 +0100

Comment on draft-klensin-net-utf8-07.txt:

--------------------------

"Network Virtual Terminal (NVT)" occurs first in Appendix A.
The explanation of the abbreviation should (also) be given at
the first occurence of "NVT" in the document.

--------------------------

Section 2, point 2, "Line-endings..."

       "discussion.  The newer control characters IND (U+0084) and NEL
       ("Next Line", U+0085) might have been used to disambiguate the"

I have a hard time figuring out what IND was supposed to be used for,
but I don't think it was for line endings. Chain printer "font" change is
the closest I get... (http://www.freepatentsonline.com/3699884.html).

NEL is used in EBCDIC originally (IIUC), and still used in EBCDIC...

The description "might have been used to disambiguate" is more
appropriate for U+2028 and U+2029.

--------------------------

       "it, lines end in CRLF and only in CRLF.  Anything that does not
       end in CRLF is either not a line or is severely malformed."

The sentence starting with "Anything" seems  severely malformed...
You don't really meant to say "Anything", I hope. "Using other line
ending or line separation conventions" perhaps. And "severely
malformed", I hope you did not mean that either. "is lacking in
conversion to 'net-utf8'/'net-Unicode'" perhaps.

To be "rescrictive in what one emits and permissive/liberal in
what one receives" might be applicable here.

Upon reciept, the following SHOULD be seen as at least line ending
(or line separating), and in some cases more than that: 

LF, CR+LF, VT, CR+VT, FF, CR+FF, CR (not followed by NUL...),
NEL, CR+NEL, LS, PS
where
LF	U+000A
VT	U+000B
FF	U+000C
CR	U+000D
NEL	U+0085
LS	U+2028
PS	U+2029

even FS, GS, RS
where
FS	U+001C
GS	U+001D
RS	U+001E
should be seen as line separating (Unicode specifies these as having bidi
property B, which effectively means they are paragraph separating).

Apart from CR+LF, these SHOULD NOT be emitted for net-utf8, unless
that is overriden by the protocol specification (like allowing FF, or CR+FF).
When faced with any of these in input **to be emitted as net-utf8**, each
of these SHOULD be converted to a CR+LF (unless that is overridden
by the protocol in question).

--------------------------

Section 2, point 3:

You have made an exception for FF (because they occur in RFCs?).
I think FF SHOULD be avoided, just like VT, NEL, and more (see above).
Even when it is allowed, it, and CR+FF, should be seen as line separating.

You have also (by implication) dismissed HT, U+0009. The reason for this in
unclear. Especially since HT is so common in plain texts (often with some
default tab setting). Mapping HT to SPs is often a bad idea. I don't think a
default tab setting should be specified, but the effect of somewhat (not wildly)
different defaults for that is not much worse than using variable width fonts.

SP, U+0020, is nowadays not seen as a control character, not even in
your own text... (same paragraph).

--------------------------

   "However, because they were optional in NVT applications
   and this specification is an NVT superset, they cannot be prohibited
   entirely." 

Why not? Why must this be a strict NVT superset? I think it would be rather
important to rule these strange beasts out from net-utf8. These were really
ASCII (ISO 646) features, but have been ruled out much before Unicode.

--------------------------

   "The most important of these rules is that CR MUST NOT
   appear unless it is immediately followed by LF (indicating end of
   line) or NUL."

I don't see how that follows (read: that does not follow).

--------------------------

     "[ISO10646]
              International Organization for Standardization,
              "Information Technology - Universal Multiple- Octet Coded
              Character Set (UCS) - Part 1: Architecture and Basic
              Multilingual Plane"", ISO/IEC 10646-1:2000, October 2000."

That seems a bit old... Better with the current revision:

ISO/IEC 10646:2003   Information technology -- Universal Multiple-Octet Coded
Character Set (UCS)

with the amendments (which I don't think you should reference explicitly):
ISO/IEC 10646:2003/Amd 1:2005  Glagolitic, Coptic, Georgian and other characters
ISO/IEC 10646:2003/Amd 2:2006  N'Ko, Phags-pa, Phoenician and other characters
(and more amendments in the works).

--------------------------

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf