On 22.01.12 11:03, Nguyen Thai Ngoc Duy wrote: > 2012/1/22 Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx>: >>>> In order to prevent that ever a file name in decomposed unicode is >>>> entering the index, a "brute force" attempt is taken: all arguments into >>>> git (argv[1]..argv[n]) are converted into precomposed unicode. > > Forgot one more thing. We have case-insensitive support in place > already, we can hook precomposed form conversion there before > comparing. In other words we just need to support > {pre,de}composed-insensitive string compare. > >Forgot one more thing. We have case-insensitive support in place >already, we can hook precomposed form conversion there before >comparing. In other words we just need to support >{pre,de}composed-insensitive string compare. >-- Duy Yes, I like that idea. After doing some experiments with precomposed and decomposed file names, the motvation for the fix, or so to say the root cause, changed. The Mac OS X file system on VFAT has a kind of schizophrenia: - unicode file names are written as precomposed onto the disk and Linux/Windows see them as precomposed. - readdir() returns always decomposed. - open()/fopen() stat() lstat() works for both pre- and decomposed Therefore a repository on VFAT under Mac OS X looks as follows: - file names on disk are precomposed - file names in the index are decomposed As long as only Mac OS uses that device, there is no problem. When we now move the e.g. USB stick using VFAT to Linux the "decomposed" in the index seem to be deleted and the precomposed on disk are untracked. A complete desaster. To keep the file names in the index and how they are stored on disk the same, we can can set "core.precomposedunicode" = true. (Side note: should we rename it to "i18n.precomposedunicode" ?) Now we keep the index and file names on disk the same, and can move the USB stick between Linux, Mac OS X, or Windows (since msysGit-1.7.10- now called Git for Windows, or git under cygwin 1.7) The same problem occurs when Mac OS mounts a network share from linux using SAMBA: readdir() returns decomposed, creat() stores file names in precomposed on the remote linux machine. If we put a file name with decomposed unicode on the linux machine, it will be listed as decomposed by readdir() on the Mac OS side. Trying to access this file failes, because Mac OS tries to open the precomposed version, and that does not exist. Another thing: There are some reasons to avoid decomposed unicode in Linux and Windows: Many user space programs don't handle decomposed unicode very well. When e.g. an "û" should be displayed, the output looks like "u^" in many programs. And if we need more motivations: decomposed unicode is hard to enter on the keyboard. Then I went back to my original problem (versioning the "Documents" folder on my Linux $HOME under git and access it from Windows and Mac OS using SAMBA, or cloning it to a laptop...) Knowing that a) Mac OS X handles precomposed and decomposed the same in open()... b) The user space program on Mac OS handle precomposed just fine c) Many user space programs don't presentate decomposed as it should be d) It is hard to enter decomposed unicode at the keyboard e) and therefore decomposed unicode is seldom used on Linux f) Mac OS using SAMBA puts file names in precomposed unicode on the remote side Do we have a motivation for pushing a solution that ignores the unicode composition ? I'll send a V5 version with hopefully a better motivation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html