Re: Troubles with UTF-8

Ned Freed <ned.freed@xxxxxxxxxxx> · Fri, 23 Dec 2005 10:13:17 -0800 (PST)

> The IETF mandates the use of UTF-8 for text [RFC2277] as part of
> internationalisation.  When writing an RFC, this raises a number of issues.

> A) Character set.  UTF-8 implicitly specifies the use of Unicode/IS10646 which
> contains 97,000 - and rising - characters.  Some (proposed) standards limit
> themselves to 0000..007F, which is not at all international, others to
> 0000-00FF, essentially Latin-1, which suits many Western languages but is not
> truly international.  Is 97,000 really appropriate or should there be a defined
> subset?

Short answer: No.

Longer answer: Subsetting makes sense in some places, such as when defining
canonicalization schemes like our various *prep profiles. You pretty much have
to limit yourself to what's currently defined for these and Unicode is, for
better or worse, open-ended.

However, limiting everything to a subset of Unicode simply because "97,000
characters are too many" is a terrible idea. This has been tried - take  a look
at the various string types for Unicode in ASN.1 or the text body part types in
X.400 - and the result has been nothing but a mess.

> B) Code point. Many standards are defined in ABNF [RFC4234] which allows code
> points to be specified as, eg,  %b00010011 %d13 or %x0D none of which are
> terribly Unicode-like (U+000D).  The result is standards that use one notation
> in the ABNF and a different one in the body of the document; should ABNF allow
> something closer to Unicode (as XML has done with &#000D;)?

ABNF is charset-independent, mapping onto non-negative integers, not
characters. Nothing prevents a specification from saying that a given ABNF
grammar specifies a series of Unicode characters represented in UTF-8 and using
%xFEFF or whatever in the grammar itself.

> C) Length. Text is often variable in length so the length must be determined.
> This may be implicit from the underlying protocol or explicit as in a TLV.  The
> latter is troublesome if the protocol passes through an application gateway
> which wants to normalise the encoding so as to improve security and wants to
> convert UTF to its shortest form with corresponding length changes

The various length issues and tradeoffs that exist in different Character
Encoding Schemes for Unicode are well known, have been extensivey debated, and
are well understood. There are inherent tensions between the various formats
(preserve ASCII vs. equal length for all characters vs. wasting space)
and this makes any choice a compromise.

> (Unicode
> lacks a no-op, a meaningless octet, one that could be added or removed without
> causing any change to the meaning of the text).

NBSP is used for this purpose.

> Other protocols use a terminating sequence.  NUL is widely used in *ix; some
> protocols specify that NUL must terminate the text, some specify that it must
> not, one at least specifies that embedded NUL means that text after a NUL must
> not be displayed (interesting for security).  Since UTF-8 encompasses so much,
> there is no natural terminating sequence.

This simply isn't true. NUL is present in Unicode and is commonly used as  a
terminator.

> D) Transparency.  An issue linked to C), protocols may have reserved characters,
> used to parse the data, which must not then appear in text.  Some protocols
> prohibit these characters (or at least the single octet encoding of them),
> others have a transfer syntax, such as base64, quoted-printable, %xx or an
> escape character ( " \ %).  We could do with a standard syntax.

I disagree. Different environments have very different constraints, and in many
cases these constraints interact with the underlying character data in ways
that force the use of different escaping conventions. For example, the
differences between quoted-printable the content-transfer-encoding and the Q
encoding of encoded-words are NOT gratuitous.

> E) Accessibility.  The character encoding is specified in UTF-8 [RFC3629] which
> is readily accessible (of course:-) but to use it properly needs reference to
> IS10646, which is not.  I would like to check the correct name of eg
> hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absence of IS10646 am
> unable to do so.

The entire Unicode character database is readily available online:

   http://www.unicode.org/ucd/

A quick check shows that 0x002D is written as HYPHEN-MINUS in the database and
in the code charts.

I didn't need (and never have needed) a copy of ISO 10646 to find out stuff
like this. I do find a copy of the printed Unicode book to be useful but not
required (and I have versions 1 through 3), but it is readily available, albeit
not free.

The reason many of our standards documents refer to ISO 10646 is that at one
time there was concern that Unicode wasn't sufficiently stable, and it was felt
that reference to the ISO document would offer some protection against
capricious change. I think in retrospect this concern has been shown to be
unwarranted, and all things being equal I would prefer to see references to the
more readily available Unicode materials. (Given the wide deployment of Unicode
now there is effectively no chance of a major change along the lines of the
Hangul reshuffle between V1 and V2.)

> Overall, my perception is that we have the political statement - UTF-8 will be
> used - but have not yet worked out all the engineering ramifications.

Well, we have a lot more than a political statement - a huge amount of
engineering work has been done to make Unicode workable in IETF protocols and
elsewhere. Does more work  need to be done? Of course it does - these tasks are
by their very nature pretty much unending. But all of the points you have
raised here are either nonissues, settled issues, or engineering compromises
where there is no "right" answer.

				Ned

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf