Re: Troubles with UTF-8

James Seng <james@xxxxxxx> · Wed, 4 Jan 2006 13:50:27 +0800

On 12/23/05, Tom.Petch <sisyphus@xxxxxxxxxxxxxx> wrote:
A) Character set.  UTF-8 implicitly specifies the use of Unicode/IS10646 which
contains 97,000 - and rising - characters.  Some (proposed) standards limit
themselves to 0000..007F, which is not at all international, others to

0000-00FF, essentially Latin-1, which suits many Western languages but is not
truly international.  Is 97,000 really appropriate or should there be a defined
subset?

Why should there be a subset? You really really dont want to go into a debate of which script is more important then the other.

B) Code point. Many standards are defined in ABNF [RFC4234] which allows code

points to be specified as, eg,  %b00010011 %d13 or %x0D none of which are
terribly Unicode-like (U+000D).  The result is standards that use one notation
in the ABNF and a different one in the body of the document; should ABNF allow

something closer to Unicode (as XML has done with &#000D;)?

Following RFC4234, Unicode code point U+ABCD will just be represented as %xABCD. 

I do not see the problem you mention or am I missing something?

C) Length. Text is often variable in length so the length must be determined.

This may be implicit from the underlying protocol or explicit as in a TLV.  The
latter is troublesome if the protocol passes through an application gateway
which wants to normalise the encoding so as to improve security and wants to

convert UTF to its shortest form with corresponding length changes (Unicode
lacks a no-op, a meaningless octet, one that could be added or removed without
causing any change to the meaning of the text).

While the simple byte counting obviously wont give you the accurate length of the text (since one character in Unicode maybe represented by one or more bytes), it is fairly trival to write a script to count the length of the text accurately. Heck, perl 
5.6 onwards even support Unicode natively.

Other protocols use a terminating sequence.  NUL is widely used in *ix; some

protocols specify that NUL must terminate the text, some specify that it must
not, one at least specifies that embedded NUL means that text after a NUL must
not be displayed (interesting for security).  Since UTF-8 encompasses so much,

there is no natural terminating sequence.

NUL is defined in Unicode btw but I am disgressing; You already started with a wrong foot if you think UTF-8 as some sort of programming encoding scheme rather then what it is; an encoding scheme for a character reportairs.

D) Transparency.  An issue linked to C), protocols may have reserved characters,

used to parse the data, which must not then appear in text.  Some protocols
prohibit these characters (or at least the single octet encoding of them),
others have a transfer syntax, such as base64, quoted-printable, %xx or an

escape character ( " \ %).  We could do with a standard syntax.

In those cases, Unicode U+ABCD or ANBF %xABCD do nicely. Why do we need another one?

E) Accessibility.  The character encoding is specified in UTF-8 [RFC3629] which
is readily accessible (of course:-) but to use it properly needs reference to
IS10646, which is not.  I would like to check the correct name of eg

hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absence of IS10646 am
unable to do so.

In absence of a dictionary, I couldn't understand most of the words you used in an RFC. OMG, what should I do?

http://www.unicode.org/charts/

-James Seng

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf