> The IETF mandates the use of UTF-8 for text [RFC2277] as part of > internationalisation. When writing an RFC, this raises a number of issues. > A) Character set. UTF-8 implicitly specifies the use of Unicode/IS10646 which > contains 97,000 - and rising - characters. Some (proposed) standards limit > themselves to 0000..007F, which is not at all international, others to > 0000-00FF, essentially Latin-1, which suits many Western languages but is not > truly international. Is 97,000 really appropriate or should there be a defined > subset? Short answer: No. Longer answer: Subsetting makes sense in some places, such as when defining canonicalization schemes like our various *prep profiles. You pretty much have to limit yourself to what's currently defined for these and Unicode is, for better or worse, open-ended. However, limiting everything to a subset of Unicode simply because "97,000 characters are too many" is a terrible idea. This has been tried - take a look at the various string types for Unicode in ASN.1 or the text body part types in X.400 - and the result has been nothing but a mess. > B) Code point. Many standards are defined in ABNF [RFC4234] which allows code > points to be specified as, eg, %b00010011 %d13 or %x0D none of which are > terribly Unicode-like (U+000D). The result is standards that use one notation > in the ABNF and a different one in the body of the document; should ABNF allow > something closer to Unicode (as XML has done with �D;)? ABNF is charset-independent, mapping onto non-negative integers, not characters. Nothing prevents a specification from saying that a given ABNF grammar specifies a series of Unicode characters represented in UTF-8 and using %xFEFF or whatever in the grammar itself. > C) Length. Text is often variable in length so the length must be determined. > This may be implicit from the underlying protocol or explicit as in a TLV. The > latter is troublesome if the protocol passes through an application gateway > which wants to normalise the encoding so as to improve security and wants to > convert UTF to its shortest form with corresponding length changes The various length issues and tradeoffs that exist in different Character Encoding Schemes for Unicode are well known, have been extensivey debated, and are well understood. There are inherent tensions between the various formats (preserve ASCII vs. equal length for all characters vs. wasting space) and this makes any choice a compromise. > (Unicode > lacks a no-op, a meaningless octet, one that could be added or removed without > causing any change to the meaning of the text). NBSP is used for this purpose. > Other protocols use a terminating sequence. NUL is widely used in *ix; some > protocols specify that NUL must terminate the text, some specify that it must > not, one at least specifies that embedded NUL means that text after a NUL must > not be displayed (interesting for security). Since UTF-8 encompasses so much, > there is no natural terminating sequence. This simply isn't true. NUL is present in Unicode and is commonly used as a terminator. > D) Transparency. An issue linked to C), protocols may have reserved characters, > used to parse the data, which must not then appear in text. Some protocols > prohibit these characters (or at least the single octet encoding of them), > others have a transfer syntax, such as base64, quoted-printable, %xx or an > escape character ( " \ %). We could do with a standard syntax. I disagree. Different environments have very different constraints, and in many cases these constraints interact with the underlying character data in ways that force the use of different escaping conventions. For example, the differences between quoted-printable the content-transfer-encoding and the Q encoding of encoded-words are NOT gratuitous. > E) Accessibility. The character encoding is specified in UTF-8 [RFC3629] which > is readily accessible (of course:-) but to use it properly needs reference to > IS10646, which is not. I would like to check the correct name of eg > hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absence of IS10646 am > unable to do so. The entire Unicode character database is readily available online: http://www.unicode.org/ucd/ A quick check shows that 0x002D is written as HYPHEN-MINUS in the database and in the code charts. I didn't need (and never have needed) a copy of ISO 10646 to find out stuff like this. I do find a copy of the printed Unicode book to be useful but not required (and I have versions 1 through 3), but it is readily available, albeit not free. The reason many of our standards documents refer to ISO 10646 is that at one time there was concern that Unicode wasn't sufficiently stable, and it was felt that reference to the ISO document would offer some protection against capricious change. I think in retrospect this concern has been shown to be unwarranted, and all things being equal I would prefer to see references to the more readily available Unicode materials. (Given the wide deployment of Unicode now there is effectively no chance of a major change along the lines of the Hangul reshuffle between V1 and V2.) > Overall, my perception is that we have the political statement - UTF-8 will be > used - but have not yet worked out all the engineering ramifications. Well, we have a lot more than a political statement - a huge amount of engineering work has been done to make Unicode workable in IETF protocols and elsewhere. Does more work need to be done? Of course it does - these tasks are by their very nature pretty much unending. But all of the points you have raised here are either nonissues, settled issues, or engineering compromises where there is no "right" answer. Ned _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf