RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

John C Klensin <john-ietf@xxxxxxx> · Sun, 10 Feb 2008 20:35:34 -0500

--On Monday, 07 January, 2008 22:30 +0100 Kent Karlsson
<kent.karlsson14@xxxxxxxxx> wrote:

> Comment on draft-klensin-net-utf8-07.txt:
> 
> --------------------------
> 
> "Network Virtual Terminal (NVT)" occurs first in Appendix A.
> The explanation of the abbreviation should (also) be given at
> the first occurence of "NVT" in the document.

Fixed in -09

> --------------------------
> 
> Section 2, point 2, "Line-endings..."
> 
>        "discussion.  The newer control characters IND (U+0084)
> and NEL        ("Next Line", U+0085) might have been used to
> disambiguate the"
> 
> I have a hard time figuring out what IND was supposed to be
> used for, but I don't think it was for line endings. Chain
> printer "font" change is the closest I get...
> (http://www.freepatentsonline.com/3699884.html).

As far as I can tell, and based on the comments that came from
those who suggested that I make that addition, it is an index
(same position on next line) function.

> NEL is used in EBCDIC originally (IIUC), and still used in
> EBCDIC...

This is just notation.   Whether the function are the same may
or may not be relevant.

> The description "might have been used to disambiguate" is more
> appropriate for U+2028 and U+2029.

That is why the next sentence says "Similar observations
apply...".  These things represent, as far as I can tell,
iterative attempts to get things right.

> --------------------------
> 
>        "it, lines end in CRLF and only in CRLF.  Anything that
> does not        end in CRLF is either not a line or is
> severely malformed."
> 
> The sentence starting with "Anything" seems  severely
> malformed... You don't really meant to say "Anything", I hope.
> "Using other line ending or line separation conventions"
> perhaps. And "severely malformed", I hope you did not mean
> that either. "is lacking in conversion to
> 'net-utf8'/'net-Unicode'" perhaps.

Sentence has been rewritten into a conformance statement.

> To be "rescrictive in what one emits and permissive/liberal in
> what one receives" might be applicable here.
> 
> Upon reciept, the following SHOULD be seen as at least line
> ending (or line separating), and in some cases more than that: 
> 
> LF, CR+LF, VT, CR+VT, FF, CR+FF, CR (not followed by NUL...),
> NEL, CR+NEL, LS, PS
> where
> LF	U+000A
> VT	U+000B
> FF	U+000C
>...

The reasons why the robustness principle should not be applied
as you are trying to apply it are an interesting philosophical
discussion that does not, IMO, help here.  The bottom line is
that this is a spec for a single standard format, not a whole
serious of variations that senders have the right to assume that
receivers will support.

I've elided comments below that seem to be just different ways
to pursue the theme of "why don't we support every character
that might imaginably be a line-ending as if it were one".

> --------------------------
> 
> Section 2, point 3:
> 
> You have made an exception for FF (because they occur in
> RFCs?). I think FF SHOULD be avoided, just like VT, NEL, and
> more (see above). Even when it is allowed, it, and CR+FF,
> should be seen as line separating.

No. See above.  The question of what characters should be on
that list has been discussed endlessly and the text has been
changed repeatedly to explain why various proposals.  If this
work is to be completed, we need to stop somewhere.

> You have also (by implication) dismissed HT, U+0009. The
> reason for this in unclear. Especially since HT is so common
> in plain texts (often with some default tab setting). Mapping
> HT to SPs is often a bad idea. I don't think a default tab
> setting should be specified, but the effect of somewhat (not
> wildly) different defaults for that is not much worse than
> using variable width fonts.

An explanation appears in -08.

> SP, U+0020, is nowadays not seen as a control character, not
> even in your own text... (same paragraph).
> 
> 
> --------------------------
> 
>    "However, because they were optional in NVT applications
>    and this specification is an NVT superset, they cannot be
> prohibited    entirely." 
> 
> Why not? Why must this be a strict NVT superset? I think it
> would be rather important to rule these strange beasts out
> from net-utf8. These were really ASCII (ISO 646) features, but
> have been ruled out much before Unicode.

But you have argued that some of them should be treated as line
separators and any system that supports VT100 controls (i.e.,
U**x or almost any of its children) still require them.

> --------------------------

>      "[ISO10646]
>               International Organization for Standardization,
>               "Information Technology - Universal Multiple-
> Octet Coded               Character Set (UCS) - Part 1:
> Architecture and Basic               Multilingual Plane"",
> ISO/IEC 10646-1:2000, October 2000."
> 
> That seems a bit old... Better with the current revision:
> 
> ISO/IEC 10646:2003   Information technology -- Universal
> Multiple-Octet Coded Character Set (UCS)
> 
> with the amendments (which I don't think you should reference
> explicitly): ISO/IEC 10646:2003/Amd 1:2005  Glagolitic,
> Coptic, Georgian and other characters ISO/IEC 10646:2003/Amd
> 2:2006  N'Ko, Phags-pa, Phoenician and other characters (and
> more amendments in the works).

Changed in -09.   I hope you like the new form better.

      john

_______________________________________________

Ietf@xxxxxxxx
http://www.ietf.org/mailman/listinfo/ietf