Re: Troubles with UTF-8

Masataka Ohta <mohta@xxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 28 Dec 2005 22:05:57 +0900

Tom.Petch wrote:

> The Unicode data I am thinking of may have come from an upper layer protocol and
> needs to be passed transparently (as with an error or hello message, identity
> even); it may or may not already be NUL-terminated (ever had that security
> foul-up where some userid/password are entered/stored NUL-terminated and some
> are not?) - hence I see the need to terminate the string in some other way, or
> to escape or in some other way transfer encode (parts of) the string.  I looked
> at existing RFC, found many different approaches, all viable but none that
> really said to me 'this is good engineering, this is best practice'.  Hence,
> floating the issue to see if there were any better ones out there. I think not,
> which is of itself worth knowing.

You can do nothing.

That problem is that Unicode is stateful with complex and
indefinitely long term states, which is a lot worse than
properly profiled ISO 2022 such as that of RFC1468, which
is the character encoding most widely used for Japanese.

Unicode is not even finite state, which means some pattern
matching and normalization problems are hard or insolvable.

OTOH, if you start from scratch, you can have encoding with
a lot shorter term and finite states.

						Masataka Ohta

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf