Re: Troubles with UTF-8

Frank Ellermann <nobody@xxxxxxxxxxxxxxxxx> · Fri, 23 Dec 2005 14:57:02 +0100

Tom.Petch wrote:

> should there be a defined subset?

There are already some defined subsets, e.g. SASLPREP, 
NAMEPREP, etc.

> Many standards are defined in ABNF [RFC4234] which allows
> code points to be specified as, eg,  %b00010011 %d13 or %x0D
> none of which are terribly Unicode-like (U+000D).

AFAIK you can use %x10FFFE, you're not limited to %x00-FF.
You can also use %x10.FF.FE if you're talking about octets.

> should ABNF allow something closer to Unicode (as XML has
> done with &#000D;)?

Dubious, %x10FFFF is valid ABNF, but U+10FFFF is no character.

> Unicode lacks a no-op, a meaningless octet, one that could
> be added or removed without causing any change to the meaning
> of the text.

Can't you add or remove U+FFFE to fully normalized text, and
the result is still fully normalized ?  I'm not sure, better
check this.

> Since UTF-8 encompasses so much, there is no natural
> terminating sequence.

If you want NUL as terminator you're free to use it, also in
UTF-8.  There's no UTF-8 character xx00 or xx00yy if that's
your problem.  If you have the octet %x00 somewhere in UTF-8
strings it is always U+00 (also known as ASCII NUL).

> D)

Sorry, I don't get your point D.  Unicode has a block of 32
non-characters, that could be good enough for all internal
and temporary purposes of an application.  It also has the
known 65 ISO 8859 control codes, use them as you see fit in
your protocol.

> others have a transfer syntax, such as base64,
> quoted-printable, %xx or an escape character ( " \ %).
> We could do with a standard syntax.

Pick what you need as transfer encoding, B64 or QP, MIME would
even allow to roll your own.

> I would like to check the correct name of eg hyphen-minus
> (Hyphen-minus, Hyphen-Minus, ???) and in the absence of
> IS10646 am unable to do so.

Maybe get the latest Unicode 4.1.0 data file (almost one MB):
<http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>

For SASLPREP / NAMEPREP you would take Unicode version 3.2,
and for a beta test of Unicode 5.0 take the beta data.

                       Bye, Frank

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf