On Wed, 13 May 2009, Linus Torvalds wrote: > > utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But > quite frankly, the index is only part of it, and probably not the worst > part. > > The real pain of filename handling is all the "read tree recursively with > readdir()" issues. Along with just an absolute sh*t-load of issues about > what to do when people ended up using different versions of the "same" > name in different branches. Btw, if people care mainly just about OS X, and don't worry so much about case, but about the idiotic and insane OS X behavior of turning UTF-8 filenames into that crazy NFD format, here's a simple patch that may be useful for that. There _will_ certainly be other places, but this handles the one big case of "read_directory_recursive()", and can turn NFD into the sane NFC format. Since OS X will then accept NFC (and internally turn it back to NFD) when you pass them as filenames, that means that converting the other way is not necessary. NOTE NOTE NOTE! This really just handles one case, and is not enough for any kind of general case. For example, it does NOT handle the case where you do git add filename_with_åäö explicitly, because if the "filename_with_åäö" is done using NFD (tab-completion etc), now git won't _match_ it with the filename it reads using readdir() any more (which got converted to NFC), so at a minimum we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. See "get_pathspec()" in setup.c for that latter case. But with that, and this crazy thing, OS X users might be already a lot better off. Totally untested, of course. Oh, and somebody needs to fill in that convert_name_from_nfd_to_nfc() implementation. It's designed so that if it notices that the string is just plain US-ASCII, it can return 0 and no extra work is done. That, in turn, can easily be done by some simple and efficient pre-processign that checks that there are no high bits set (on a 64-bit platform, do it 8 characters at a time with a "& 0x8080808080808080"), so that the common case doesn't need to have barely any overhead at all. Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do the actual normalization if you find characters with the high bit set. And since I know that the OS X filesystems are so buggy as to not even do that whole NFD thing right, there is probably some OS-X specific "use this for filesystem names" conversion function. Hmm. Anybody want to take this on? It really shouldn't be too complex to get it working for the common case on just OS X. It's really the case sensitivity that is the biggest problem, if you ignore that for now, the problem space is _much_ smaller. In other words, I think we can reasonably easily support a subset of _common_ issues with some trivial patches like this. But getting it right in _all_ the cases is going to be much more work (there are lots of other uses of "readdir()" too, this one just happens to be one of the more central ones). Of course, it probably makes sense to have a whole "git_readdir()" that does this thing in general. That "create_full_path()" thing makes sense regardless, though, in that it also simplifies a lot of "baselen+len" usage in just "len". Linus --- dir.c | 40 ++++++++++++++++++++++++++++++++-------- 1 files changed, 32 insertions(+), 8 deletions(-) diff --git a/dir.c b/dir.c index 6aae09a..4cbfc24 100644 --- a/dir.c +++ b/dir.c @@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path) } /* + * Take the readdir output, in (d_name,len), and append it to + * our base name in (fullname,baselen) with any required + * readdir fs->internal translation. + * + * Put the result in 'fullname', and return the final length. + * + * Right now we have no translation, and just do a memcpy() + * (the +1 is to copy the final NUL character too). + */ +static int create_full_path(char *fullname, int baselen, const char *d_name, int len) +{ +#ifdef OS_X_IS_SOME_CRAZY_SHxAT + char temp[256], nlen; + nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp)); + if (nlen) { + len = nlen; + d_name = temp; + } +#endif + memcpy(fullname + baselen, d_name, len + 1); + return baselen + len; +} + +/* * Read a directory tree. We currently ignore anything but * directories, regular files and symlinks. That's because git * doesn't handle them at all yet. Maybe that will change some @@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co /* Ignore overly long pathnames! */ if (len + baselen + 8 > sizeof(fullname)) continue; - memcpy(fullname + baselen, de->d_name, len+1); - if (simplify_away(fullname, baselen + len, simplify)) + len = create_full_path(fullname, baselen, de->d_name, len); + if (simplify_away(fullname, len, simplify)) continue; dtype = DTYPE(de); exclude = excluded(dir, fullname, &dtype); if (exclude && (dir->flags & DIR_COLLECT_IGNORED) - && in_pathspec(fullname, baselen + len, simplify)) - dir_add_ignored(dir, fullname, baselen + len); + && in_pathspec(fullname, len, simplify)) + dir_add_ignored(dir, fullname, len); /* * Excluded? If we don't explicitly want to show @@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co default: continue; case DT_DIR: - memcpy(fullname + baselen + len, "/", 2); + memcpy(fullname + len, "/", 2); len++; - switch (treat_directory(dir, fullname, baselen + len, simplify)) { + switch (treat_directory(dir, fullname, len, simplify)) { case show_directory: if (exclude != !!(dir->flags & DIR_SHOW_IGNORED)) @@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co break; case recurse_into_directory: contents += read_directory_recursive(dir, - fullname, fullname, baselen + len, 0, simplify); + fullname, fullname, len, 0, simplify); continue; case ignore_directory: continue; @@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co if (check_only) goto exit_early; else - dir_add_name(dir, fullname, baselen + len); + dir_add_name(dir, fullname, len); } exit_early: closedir(fdir); -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html