Re: Can I display Chinese character filenemes in an

James Richard Tyrer <tyrerj@xxxxxxx> · Wed, 06 Oct 2004 11:54:09 -0700

Robin Rosenberg wrote:
On Monday 04 October 2004 18.35, James Richard Tyrer wrote:

Robin Rosenberg wrote:

On Monday 04 October 2004 04.56, James Richard Tyrer wrote:

Obviously, what I said is not Chinese specific.  It applies to any and
all UTF-8 encoded file names.  ISO-8859-1 is a subset of UTF-8 so Latin
characters will display just the same.

No. ASCII is a subset of UTF-8.  ISO-8859-1 and UTF-8 are different and
incompatible (or I'd would be using UTF-8 today).

I have: "LANG=en_us.utf8" and I have no problems.  IIRC, that is what I
have read at authoritative sources.  But, do you mean that glyphs 128-255
are not the same in ISO-8859-1 and UTF-8?  Perhaps there are some problems
that I am not aware of since all I ever use (128-255) are Latin letters
with diacritical marks.  It does appear that odd combinations of characters
could be interpreted as something other than ISO-8859-1.

ISO-8859-1 is both an encoding and a character set while UTF-8 is only and 
encoding for the unicode character set. The code points of these overlap at 
the first 256 posititions.  When looked upon as encodings only the first 127 
positions are identical. UTF-8 can encoding all characters in the ISO-8859-1 
character set, but it does it differently. UTF-8 does this with a variable 
length encoding.

The filename "åäö" can be stored as the byte sequence [e5 e4 f6] when my 
locale is set to ISO-8859-1 or [c3 a5 c3 a4 c3 b6] when using UTF-8. I can't

have it both ways. The UTF-8 encoding shows up as "Ã¥Ã¤Ã¶" (unreadable 
garbage). In order to swith my locale from ISO-8859-1 to UTF-8 I have to 
convert my filenames as most non-ascii filename would be illegal in UTF-8 
(not that many programs care). The others (non-ascii again) will look wrong.

Do "ls filenamewithdiacriticalmarks|od -tx1" and you'll see a variable length

encoding with one or two bytes depending on character (chinese characters are 
even longer). UTF-8 could require up to six bytes for one single character. 
I'm not sure if the unicode consortium has defined any such character yet.

I do note two things:

The first 256 glyphs of Unicode *are* the same as ISO8859-1.

It appears that KDE's clipboard converts to UTF-8 automatically.

--
JRT
___________________________________________________
.
Account management:  https://mail.kde.org/mailman/listinfo/kde.
Archives: http://lists.kde.org/.
More info: http://www.kde.org/faq.html.