Troubles with UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The IETF mandates the use of UTF-8 for text [RFC2277] as part of
internationalisation.  When writing an RFC, this raises a number of issues.

A) Character set.  UTF-8 implicitly specifies the use of Unicode/IS10646 which
contains 97,000 - and rising - characters.  Some (proposed) standards limit
themselves to 0000..007F, which is not at all international, others to
0000-00FF, essentially Latin-1, which suits many Western languages but is not
truly international.  Is 97,000 really appropriate or should there be a defined
subset?

B) Code point. Many standards are defined in ABNF [RFC4234] which allows code
points to be specified as, eg,  %b00010011 %d13 or %x0D none of which are
terribly Unicode-like (U+000D).  The result is standards that use one notation
in the ABNF and a different one in the body of the document; should ABNF allow
something closer to Unicode (as XML has done with &#000D;)?

C) Length. Text is often variable in length so the length must be determined.
This may be implicit from the underlying protocol or explicit as in a TLV.  The
latter is troublesome if the protocol passes through an application gateway
which wants to normalise the encoding so as to improve security and wants to
convert UTF to its shortest form with corresponding length changes (Unicode
lacks a no-op, a meaningless octet, one that could be added or removed without
causing any change to the meaning of the text).

Other protocols use a terminating sequence.  NUL is widely used in *ix; some
protocols specify that NUL must terminate the text, some specify that it must
not, one at least specifies that embedded NUL means that text after a NUL must
not be displayed (interesting for security).  Since UTF-8 encompasses so much,
there is no natural terminating sequence.

D) Transparency.  An issue linked to C), protocols may have reserved characters,
used to parse the data, which must not then appear in text.  Some protocols
prohibit these characters (or at least the single octet encoding of them),
others have a transfer syntax, such as base64, quoted-printable, %xx or an
escape character ( " \ %).  We could do with a standard syntax.

E) Accessibility.  The character encoding is specified in UTF-8 [RFC3629] which
is readily accessible (of course:-) but to use it properly needs reference to
IS10646, which is not.  I would like to check the correct name of eg
hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absence of IS10646 am
unable to do so.

Overall, my perception is that we have the political statement - UTF-8 will be
used - but have not yet worked out all the engineering ramifications.

Tom Petch

Tom Petch


_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf

[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]