One of Linus's recent patch introduces an index hashtable so that we can later hash "equivalent" names into the same bucket to allow us non-byte-by-byte comparison. Before going further, I needed to formalize what we are trying to achieve. I learned a few things from the long flamewar thread, but it is very inefficient to go back to the thread to pick only the useful pieces. The whole flamewar simply did not fit a small Panda brain. That was the reason for this write-up. Design constraints. In the following, I'll use two names $A and $B as an example. They are a pair of names that are considered equivalent in some contexts, such as: A=xt_connmark.c B=xt_CONNMARK.c (1) Some filesystems prevent you from having these two (confusing) paths in a directory at the same time. Some do not implement this confusion prevention, and allows both names to exist at the same time. Let's call the former "case insensitive", and the latter "case sensitive". (2) readdir(3) on some "case insensitive" filesystems returns $A, after a successful creat(2) of $B. Others remember which one of the two "equivalent" names were used in creat(2). Let's call the former "case folding", and the latter "case preserving". We assume open(2) or lstat(2) of $A or $B will succeed after allowing creat(2) of $B if a case folding filesystem returns $A from readdir(3). (3) Among the "case folding" ones, some filesystems fold the pathname to a form that is less interoperable with other systems, and/or the form that is likely to be different from what the end-user usually enters. Such filesystems are "inconveniently case folding". The last one is not quite apparent with the "xt_connmark.c" example, but if you replace $A and $B in the above description with: A=Ma"rchen B=Märchen it would hopefully become more clear. For example, vfat is generally "case preserving". In that long flamewar thread, I think we learned that HFS+ is in general "inconveniently case folding" with respect to Unicode, by always folding to $A but the keyboard/IM input is more likely to come as $B, which happens to be the more interoperable form with other systems. Issues with case insensitive filesystems ---------------------------------------- At the data structure level, a pathname to git is a sequence of bytes terminated with NUL. This will _not_ change. By the way, at the data structure level, a tree entry in git can represent a blob that is a symbolic link. A tree entry in git can also represent a blob that is a regular file, and in that case, it can represent if it is executable or not. These will also not change. Now, let's think about how we allow use of git on a filesystem that is incapable of symbolic links, and/or a filesystem that does not have trustable executable bit. We do not say "Symlinks are evil and not supported everywhere, so let's introduce a project configuration to disallow addition of symlinks". We do not say that to the executable bit, either. Instead, we have fallback methods to allow manipulating symlinks and executable bit on such a filesystem that is incapable of handling them natively. We should be able to do the same for this "case sensitivity" issue. A tree that has xt_connmark.c and xt_CONNMARK.c at the same time cannot be checked out on a case insensitive filesystem. The filesystem is simply incapable of it (please just calmly rephrase it in your head as "does not allow such confusing craziness" instead of starting another flamewar, if you feel the expression "incapable of" insults your favorite filesystem). That may mean the project should avoid such equivalent names in its trees (and having a project wide configuration could be a technical means to help enforcing that policy), but it does not mean the core level of git should prevent them to be created on such systems. It just means that there should be a way, that could (and sometimes has to) be different from the "natural" way, to manipulate such tree entries even on a case insensitive filesystem. For example, if I find that RelNotes symlink incorrectly points at Documentation/RelNotes-1.5.44.txt and want to fix it and push it out immediately, but if I am on the road and the only environment I can borrow is a git installation on a filesystem that is symlink-challenged, I can still do the fix. On such a filesystem, a symlink is checked out as a regular file but is still marked as a symlink in the index. The only thing I need to do is to edit the file (making sure not to add an extra LF at the end) and add it to the index. That's certainly different from the "natural" way to do that on a filesystem with symlinks, which is "ln -fs Documentation/RelNotse-1.5.4.txt RelNotes", but the point is that we make it possible. The same thing should apply to two files that cannot be checked out at the same time on case insensitive filesystems. Perhaps we could have something like: $ git show :xt_CONNMARK.c >xt_connmark-1.c $ edit xt_connmark-1.c $ git add --as xt_CONNMARK.c xt_connmark-1.c Issues with case folding filesystems ------------------------------------ In addition to the above, case folding filesystems additionally have an issue even when there is no "confusing" names in the tree. The project may want to have "Märchen" (but not "Ma"rchen"), but a checkout (which is creat(2) of "Märchen" -- because that is the byte sequence recorded in tree objects and the index) will result in "Ma"rchen" and no "Märchen" (hence readdir(3) returns "Ma"rchen"). Linus's patch to use a hashtable that links "equivalent" names together is a step in the right direction to address this. The tree (and the index) has name $B, we check out and the filesystem folds it to $A. When we get the name $A back from the filesystem (via readdir(3)), we hash the name using a hash function that would drop names $A and $B into the same bucket, and compare that name $A with each hash entry using a comparison that considers $A and $B are equivalent. If we find one, then we keep the name $B we have already. If it is a new file, we won't find any name that is equivalent to $A in the index, and we use the name $A obtained from readdir(3). BUT with a twist. If the filesystem is known to be inconveniently case folding, we are better off registering $B instead of $A (assuming we can convert from $A to $B). One bad issue during development is that we cannot sanely emulate case folding behaviour on non case-folding filesystems without wrapping open(2), lstat(2), and friends, because of the assumption we made above in (2) where we defined the term "case folding". This means that the codepath to deal with case folding filesystems inevitably are harder to debug. Tasks ----- - Identify which case folding filesystems need to be supported, and make sure somebody understands its folding logic; - For each supported case folding logic, these are needed: - a hash function that throws "equivalent" names in the same bucket, to be used in Linus's patch; - a compare function to determine equivalent names; - a convert function that takes a possibly inconvenient form of equivalent name (i.e. $A above) as input and returns more convenient form (i.e. $B above) - Identify places that we use the names obtained from places other than the index and tree. From these places, we would need to call the convert function to (de)mangle the name before they hit the index. Because we may be getting driven by something like: $ find | xargs git-foo handling readdir(3) we do ourselves any specially does not make much sense. Any path from the user is suspect. - Identify places that we look for a name in the index, and perform equivalent comparison instead of memcmp(3) we traditionally did. Linus's patch gives scaffolding for this. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html