The IETF mandates the use of UTF-8 for text [RFC2277] as part of internationalisation. When writing an RFC, this raises a number of issues. A) Character set. UTF-8 implicitly specifies the use of Unicode/IS10646 which contains 97,000 - and rising - characters. Some (proposed) standards limit themselves to 0000..007F, which is not at all international, others to 0000-00FF, essentially Latin-1, which suits many Western languages but is not truly international. Is 97,000 really appropriate or should there be a defined subset? B) Code point. Many standards are defined in ABNF [RFC4234] which allows code points to be specified as, eg, %b00010011 %d13 or %x0D none of which are terribly Unicode-like (U+000D). The result is standards that use one notation in the ABNF and a different one in the body of the document; should ABNF allow something closer to Unicode (as XML has done with �D;)? C) Length. Text is often variable in length so the length must be determined. This may be implicit from the underlying protocol or explicit as in a TLV. The latter is troublesome if the protocol passes through an application gateway which wants to normalise the encoding so as to improve security and wants to convert UTF to its shortest form with corresponding length changes (Unicode lacks a no-op, a meaningless octet, one that could be added or removed without causing any change to the meaning of the text). Other protocols use a terminating sequence. NUL is widely used in *ix; some protocols specify that NUL must terminate the text, some specify that it must not, one at least specifies that embedded NUL means that text after a NUL must not be displayed (interesting for security). Since UTF-8 encompasses so much, there is no natural terminating sequence. D) Transparency. An issue linked to C), protocols may have reserved characters, used to parse the data, which must not then appear in text. Some protocols prohibit these characters (or at least the single octet encoding of them), others have a transfer syntax, such as base64, quoted-printable, %xx or an escape character ( " \ %). We could do with a standard syntax. E) Accessibility. The character encoding is specified in UTF-8 [RFC3629] which is readily accessible (of course:-) but to use it properly needs reference to IS10646, which is not. I would like to check the correct name of eg hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absence of IS10646 am unable to do so. Overall, my perception is that we have the political statement - UTF-8 will be used - but have not yet worked out all the engineering ramifications. Tom Petch Tom Petch _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf