On Mon, Oct 23, 2006 at 05:50:03PM +0200, Edwin Steiner wrote: > > I checked what the reference implementation does, using the attached > program: The RI always interprets the filenames it gets from the system > as latin1 (or similar), independent of the file.encoding property, it > seems. This has the following consequences: I conducted further tests (the attached shell scripts help to create a file with latin1 and utf-8 encoded name, respectively). What the RI does depends on the setting of the LANG variable. For LANG=C * in the latin1 filename e4 gets replaced by the replacement character U+fffd, encoded as ef bf bd in UTF-8 output (see http://www.fileformat.info/info/unicode/char/fffd/index.htm). * in the utf-8 filename c3 a4 becomes replaced by _two_ replacement characters: U+fffd U+fffd. For LANG=en_US.UTF-8 * the latin1 character e4 gets replaced by the replacement character (e4 becomes U+fffd). * the utf-8 filename is read correctly (c3 a4 becomes U+00e4). For LANG=en_US.iso88591 * the latin1 filename is read correctly (e4 becomes U+00e4). * the utf-8 filename is read as latin1 (c3 a4 becomes the _two_ characters U+00c3 and U+00a4). Aren't encodings fun? ;) -Edwin
Attachment:
t-latin1.sh
Description: Bourne shell script
Attachment:
t-utf8.sh
Description: Bourne shell script