Re: Troubles with UTF-8

Masataka Ohta <mohta@xxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 04 Jan 2006 16:12:18 +0900

Tim Bray wrote:

>> That problem is that Unicode is stateful with complex and
>> indefinitely long term states

> Has this ever caused a real problem to a real programmer in real life?

Yes, of course. State information preserved between lines is
really annoying.

But, you miss the point in my original mail:

: Unicode is not even finite state, which means some pattern
: matching and normalization problems are hard or insolvable.

that is, with Unicode, you can not search strings in reasonable
amount of time.

> I have written a whole bunch of mission-critical code that reads and  
> generates UTF-8, and any correct implementation will have to deal  with 
> the fact that there is no necessary connection between the  number of 
> glyphs on the screen and bytes in its encoding.

You completely miss the point. It has nothing to do with the long
term state.

> It would  be perfectly 
> reasonable for an implementation to declare a  limitation, for example 
> that it will not process than 32 trailing  modifiers on any character, 
> and this would not cause problems in  production because sequences of 
> such a length do not occur in the  encoding of any known text.

I said "long term state", which, of course, is not confined in a
character with or without modifiers.

> Which is to say, Ohta's statement about statefulness is true, but the  
> conclusion that this is a "problem" is erroneous. -Tim

Instead, your statement: "I have written a whole bunch of mission-
critical code that reads and generates UTF-8" is untrustworthy.

Of course, it is perfectly reasonable for an implementation to
declare a limitation, for example, that it will not process
non-ASCII characters, which may also be the assumption of your
code.

						Masataka Ohta 

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf