Re: Rationale behind UTF-8?

Havoc Pennington <hp@xxxxxxxxxx> · 10 Oct 2002 14:46:25 -0400

Robert Claeson <r.claeson@computer.org> writes:
> I noticed that Unicode UTF-8 is now the default encoding when most
> Western Europeans locales are selected. Since some ISO 8859 character
> set is usually the norm for those locales, I would be interested in the
> rationale behind Psyche using UTF-8 rather than ISO 8859.
> 

The reasons for Unicode include:

 - so you can use multiple languages at once in a document

 - so that programs can write a single generic algorithm 
   for say word breaking, instead of special-casing each 
   locale

 - because most of the modern apps (all Qt, GTK apps, most scripting
   languages, etc.) are using Unicode internally, so using it
   externally speeds things up

 - so that Chinese/Japanese/Korean are going through the same
   codepaths as European languages, so that there are fewer
   CJK-specific issues. (Of course we don't default to UTF-8 for CJK
   yet, but it's coming.)

 - because the filesystem needs to be in UTF-8 unless all users
   of a system are using the same language exclusively

FWIW, the issues people are seeing with UTF-8 are almost all things
that Asian users have been living with for years... now everyone's in
the same boat, let's patch the leaks. ;-)

Havoc