Re: Troubles with UTF-8

"JFC (Jefsey) Morfin" <jefsey@xxxxxxxxxx> · Fri, 23 Dec 2005 15:46:46 +0100

At 13:44 23/12/2005, Masataka Ohta wrote:
Tom.Petch wrote:
> Overall, my perception is that we have the political statement - 
UTF-8 will be
> used - but have not yet worked out all the engineering ramifications.

Correct. Like so many results of IETF, enforcing Unicode just does
not work.

Amen. This is an architectural feature decided for political reasons 
which does not scale.

But, never mind. Unicode has nothing to do with the internationalization.

I beg to differ on wording. Internationalization is an IETF/Unicode 
word. It is part of the equation "globalization=global environment 
internationalization + local environment localization". Its IBM 
understanding is to reduce the lingual barrier between the core and 
the ends it relates with. I think it is appropriate to the IETF 
US-ASCII based Internet technology.

But the real world is "multinationalization" (if to keep the same 
image, or multilingualization): the same but for every end to end 
relation (and languages). Let consider the IETF RFC 2277 proposition: 
content must be in Unicode (client system) and the protocol is in 
US-ASCII (core system). A document may look being in a language, but 
when you read its source it is in English interspread with unicoded text.

The internationalization (RFC 3066bis) culture is unilateral. 
Networking calls for a multilateral culture architecture (RFC 4151 may help).

The only solution I see, which addresses the requirements of Tom 
Petch, is to go through a common universalisation layer (not charset 
dependent), accepting the existing US-ASCII environment of Masataka 
Ohta as a maximum. It should then down to Hexa. Getting rid of the 
Unicode based layer violations, and permitting a full charset support 
strategy where Unicode could fully play its role of common reference.

Obviously two-tier policies based on langtags could not develop as 
easily as planned.
jfc

> others to
> 0000-00FF, essentially Latin-1, which suits many Western languages but
> is not truly international.

The only appropriate subset of Unicode is 0000-007f, ASCII. Latin-1,
which introduced the confusions of the currency symbol and NBSP, is
already overkill.

> Unicode lacks a no-op, a meaningless octet,

The confusion of NBSP implies that spaces are not so meaningful
octets so that it may be replaced by line break characters.

So, the situation is worse than you would have considered and even
full Latin-1 is hopeless.

Just interpret UTF-8 ASCII.

                                                        Masataka Ohta

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf