Re: Last Call: draft-klensin-unicode-escapes (ASCII Escaping ofUnicode Characters) to BCP

Peter Constable <petercon@xxxxxxxxxxxxx> · Sun, 21 Oct 2007 21:57:37 -0700

I have a terminological objection to this draft, mainly in section 2. I have other comments regarding section 2 I'll mention.

First, terminology: the heading for section 2 has "...Table Position...", and the body refers to "code point position in the table". While the term "code table" could have been used in the Unicode Standard to refer to the encoded entities and their encoding, it is not.

The Unicode Standard uses these terms:

- It uses "character set" and "character repertoire" for the collection of elements being encoded, and "coded character set" for the set of pairs of such elements and their encoded representations.

- It uses "codespace" to refer to a range of numeric values used as encoded representations, and specifically "Unicode codespace" for the range 0 to 10FFFF (hex).

- It uses "code point" or "code position" (synonyms) for values in the Unicode codespace.

Thus, the appropriate term here is simply "code point" or "code position". "Table position" and "position in the table" are not appropriate since the Standard never uses "table" in this regard. And "code point position" is redundant. Perhaps the wording was attempting to differentiate between code points and various encoded representations of code points. But the latter are not code points per se, so there isn't really any ambiguity.

A possible refinement might be to use "Unicode Scalar Value": this refers to code points other than surrogate code points. By definition in the Standard, encoded characters can only be assigned to a Unicode Scalar Value. I don't see this as a necessary change in the draft, however.

Now for other comments on section 2.

The draft has:

  "However, when
   information about characters is to be processed by people,
   information about the Unicode code point is preferable to a further
   encoding of the encoded form of the character."

Information about the code point? (The code point of that character is numeric / is an integer / is non-negative / is in the range 0 to 10FFFF / is even / is divisible by 17 / is the same value as the number of days the song "Hey Jude" was on the Top 40 list.) I think it is the code point itself that is to be preferred, not information about it.

Also, "a further encoding of the encoding form" isn't going to be clear to readers. (I'm not sure myself what these words mean themselves; I can guess at what the author meant, though am not positive.)

Thus, I'd change this text to:

  "However, when
   information about characters is to be processed by people,
   reference to the Unicode code point is preferable to encoded
   representations of the code point."

Now, section 2 is talking about alternate representations of an encoded character, but the flow is a bit mixed up, IMO. The first paragraph says that there are different equivalent representations but that the Unicode code point is preferred. Then the next paragraph revisits the same thing in more detail. The sentence from the first paragraph discussed above, once revised so that it makes a clear statement, already says what paragraph two says in greater detail. Whether a more succinct or more detailed statement is preferred, just say it once.

Of course, if the more detailed paragraph two is kept, "code point position in the table" should be changed to "code point".

Also from paragraph two:

   "the UTF-8
   encoding or some other short-form encoding"

The term "short-form encoding" isn't explained here and may not be understood. I can only guess what is meant. If the intended meaning is what I think (a reference to shortest-form versus non-shortest-form UTF-8), then I don't think it's really relevant. Either way, I'd change the wording to:

   "the UTF-8 encoding or some other encoding form"

(Encoding form is a term defined in the Unicode Standard.)

Also:

   "the other encodes the octets of"

I don't think octets are encoded; they are simply referenced using some notational system. Thus, change to:

   "the other uses the octets of ... in some representation."

(This gives parallel wording for the two kinds of reference.)

Finally:

   "the Unicode code point forms"

Drop "forms":

   "the Unicode code points"

Peter Constable

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf