On Mon, 6 Jul 2009, Junio C Hamano wrote: > > * lt/read-directory (Fri May 15 12:01:29 2009 -0700) 3 commits > - Add initial support for pathname conversion to UTF-8 > - read_directory(): infrastructure for pathname character set > conversion > - Add 'fill_directory()' helper function for directory traversal > > Before adding the real "conversion", this needs a few real fixups, I > think. For example there is one hardcoded array that is used without > bounds check. Hmm. I'm not sure what array you're talking about (the newpath/newbase ones? We do protect against PATH_MAX, it's just that we protect against it in the "previous iteration"). The bigger issue, though, is that I spent half a day looking more at this series last Thursday, and I've got some improvements, but getting "all the way" turns out to be really quite painful. Why? We have a _lot_ of code that does "lstat()" on pathnames, and it all basically uses the internal git representation of the pathname. In particular, we do this a lot for index lookups, but it's true in other cases too (example: things like tree merging, where we check whether a file exists in the working tree). To test this all out, I actually fleshed out the patches to the point where I could do [core] PathEncoding = Latin1 and actually have the working tree use Latin1 encoding, and convert internally in git to UTF-8, and have a working "git add ." However, "git add ." was just about the only thing that I made do the right thing. Even doing a simple "git diff" afterwards would then show the file as deleted, because the UTF-8 version of the file (that the index contained) didn't exist in the filesystem. I fixed that with a hack, but it basically turns out to be pretty damn ugly, and there's a _lot_ of those places. So, the question is, "What now?" There's a few alternatives: (a) don't do any of this crap at all. What git does right now works fairly well for most people. Instead, perhaps worry about just the crazy case-insensitive filesystems, which are a totally separate issue. End result: git will always have problems with the crazy NFD format that OS X uses. Mixing git archives across OS X and other saner operating systems (and in this context, Windows really does count as "saner" - it really is OS X that is braindamaged!) will be painful if you have odd characters in your working tree. This is the simplest approach, of course. The case-insensitivity is still not trivial, but we could work on it, and it really is a different problem (and has none of the "if you look the file up with a converted name, you cannot see it" issues that the Latin1<->UTF8 example had). (b) Forget about the general case (like Latin1) that needs two-way conversion. Just worry about OS X being crazy, and do the NFD->NFC translation, which only needs to be done one way (because OS X will still accept and recognize NFC characters, so the "converted" path is still seen as valid by 'lstat()' and friends). This is very much just a special case of handling filesystems that are UTF-8, but are confused about what "equivalent" and "identical" means, and where the filesystem designer was a moron on some seriously crazy drugs, and thought that equivalence means identity, and thought that NFD is a sane form to expose. This is a much simpler case than the general approach. I don't have OS X to test with, though, and so far it hasn't appeared that any OS X people really care about to actually implement it. So I can fix up my series to a certain point, but will never be able to really do the final testing and tuning. At least with the full "treat filesystem as Latin1 encoding", I could _test_ it. (c) Try to bite the bullet. I can do this, but it really is going to be a _very_ invasive patch-series, and it will probably involve some nasty changes to the index format (for performance, we'll likely have to change the index to have _both_ the "git filename", and the "filesystem filename" in it). This was what I wanted to do, and it's what you'd need to do if you do things like Latin1 filesystem trees or ones where pathnames are done with shift-JIS encoding or if we want to actually use the (crazy) native Windows UCS filesystem accessors or whatever. But I have to admit that after looking at the pain, I'm not at all convinced it's worth it. Do we ever want to say "git supports filesystems with shift-JIS encoding"? Do people really care deeply enough about non-utf filesystems that they'd be willing to live with a _lot_ of pretty nasty complexity, and some real performance overhead? I have to say, even with plain UTF-8, git isn't really a pleasure to use. While I did my Latin1 test, I used filenames like "åäö" (the three extra Finnish/Swedish characters), and if you do this mkdir test-repo cd test-repo git init echo testfile > åäö git add . git ls-files the end result is not actually really usable. We quote it to a binary mess, rather than showing "åäö". Our pathname quoting is trying to be safe, which is good, but it does mean that right now, odd characters aren't very friendly even _if_ you are using a sane filesystem, and all plain NFC utf-8. So right now, my personal opinion is: - let's just face the fact that the only sane filename representation is NFC UTF-8. Show filenames as UTF-8 when possible, rather than quoting them. - Do case (b) above: add support for converting NFD -> NFC at readdir() time, so that OS X people can use UTF-8 sanely. - add a "binary encoding" mode to filesystems that actually use Latin1, just so that if people use Latin1 or Shift-JIS filesystem encodings, we promise that we'll never munge those kinds of names. - Maybe we'd make the "binary encoding" (which is effectively existing git behavior) be the default on non-OSX platforms. but that's just my gut feel from trying to weigh the costs of trying to do something more involved against the costs of OS X support and just letting crazy encodings exist in their own little worlds. So a development group that uses Shift-JIS (or Latin1) would be able to work internally with git that way, but would not be able to sanely work with the world at large that uses UTF-8. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html