Re: OS X and umlauts in file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 25, 2009 at 09:50, Thomas Singer <thomas.singer@xxxxxxxxxxx> wrote:
> I've did following:
>
>  toms-mac-mini:git-umlauts tom$ ls
>  Überlänge.txt
>  toms-mac-mini:git-umlauts tom$ git status
>  # On branch master
>  #
>  # Initial commit
>  #
>  # Changes to be committed:
>  #   (use "git rm --cached <file>..." to unstage)
>  #
>  #     new file:   "U\314\210berla\314\210nge.txt"
>  #
>  toms-mac-mini:git-umlauts tom$ git stage "U\314\210berla\314\210nge.txt"
>  fatal: pathspec 'U\314\210berla\314\210nge.txt' did not match any files
>
> Note, that I copy-pasted the file name which 'git status' showed to the
> stage command. IMHO, this should work, especially, because different people
> said Git would treat the file name as byte-array without interpreting it in
> some kind.
>
> From the user with the German OS X (for which the staging is said to work),
> I've got the output of 'env' and hence also tried
>
>  export LANG=de_DE.UTF-8
>
> before doing the above steps, but with the same results. :(

The problem you are having is not because of the *encoding*, it's the
Normalization form that's messing things up. The fact is that in
Unicode there are two ways to represent many -- but not all --
accented characters.

- "composed": one code point for the accented character)
- "decomposed": two code points: one for the base letter, one or more
combining characters for the accents.

The composed code points are really just backward compatibility to
legacy encodings (like LATIN-1). If you want to actually support
(rather than just tolerate) unicode you have to know how to deal with
the decomposed form, and once you can do that there's little point
beyond backward compatibility in continuing to use composed form
internally.

The Subversion people have run into this same problem because they
made the same error of assuming that any given sequence of glyphs has
only one possible representation as unicode code points and thus only
one representation as UTF-8 bytes. Dionisos has done written up the
issues involved here:

http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

// Ben
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]