Re: problem with latin1 filename and VMFile.list

Edwin Steiner <edwin.steiner@xxxxxxx> · Tue, 24 Oct 2006 01:14:25 +0200

On Mon, Oct 23, 2006 at 05:50:03PM +0200, Edwin Steiner wrote:
> 
> I checked what the reference implementation does, using the attached
> program: The RI always interprets the filenames it gets from the system
> as latin1 (or similar), independent of the file.encoding property, it
> seems. This has the following consequences:

I conducted further tests (the attached shell scripts help to create a
file with latin1 and utf-8 encoded name, respectively).

What the RI does depends on the setting of the LANG variable.

For LANG=C

    * in the latin1 filename e4 gets replaced by the replacement
      character U+fffd, encoded as ef bf bd in UTF-8 output
      (see http://www.fileformat.info/info/unicode/char/fffd/index.htm).

    * in the utf-8 filename c3 a4 becomes replaced by _two_
      replacement characters: U+fffd U+fffd.

For LANG=en_US.UTF-8

    * the latin1 character e4 gets replaced by the replacement character
      (e4 becomes U+fffd).

    * the utf-8 filename is read correctly (c3 a4 becomes U+00e4).

For LANG=en_US.iso88591

    * the latin1 filename is read correctly (e4 becomes U+00e4).

    * the utf-8 filename is read as latin1 (c3 a4 becomes the _two_
      characters U+00c3 and U+00a4).

Aren't encodings fun? ;)

-Edwin

Attachment:
t-latin1.sh

Description: Bourne shell script
Attachment:
t-utf8.sh

Description: Bourne shell script