On Sat, 21 Jul 2007, David Kastrup wrote: > Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > > > Of course, it seldom matters, but basically, you should test a directory > > structure that has the files > > > > dir.c > > dir/test > > > > in it, and the "dir" directory should always sort _after_ "dir.c". > > > > And yes, having the index entry with a '/' at the end would handle > > that automatically. > > You completely lost me here. I guess I'll be able to pick this up > only after investing considerable more time into the data structures. So the basic issue is that not only does git obviously think that only content matters, but it describes it with a single SHA1. That's not an issue at all for a single file, but if you want to describe *multiple* files with a single SHA1 (which git obviously very much wants to do), the way you generate the SHA1 matters a lot. In particular, the order. So git is very very strict about the ordering of tree structures. A tree structure is not just a random list of <ASCII mode> + <space> + <filename> + <NUL> + <SHA1> it's very much an _ordered_ list of those things, because we want the SHA1 of the tree to be well-specified by the contents, and that means that the contents of a tree object has have absolutely _zero_ ambiguity. This means, for example, that git is very fundamentally case sensitive. There's no sane way *not* to be, because if you're case insensitive in any way at all, you'll end up having two trees that are "the same", but end up having different SHA1's. It also means that git objects have absolutely zero "localization". There is no locale at all, and there very fundamnetally *must*not* be. Again, for the same reason: if you can describe the same filename with two different encodings, you'd have two different SHA1's for the same content. So git filenames are very much a "stream of bytes", not anything else. And they need to sort 100% reliably, always the same way, and never with any localized meaning. And, partly because it seemed most natural, and partly for historical reasons, the way git sorts filenames is by sorting by *pathname*. So if you have three files named a.c a/c abc then they sort in that exact order, and no other! They sort as a "memcmp" in the full pathname, and that's really nice when you see whole collections of files, and you know the list is globally sorted. So that "global pathname sorting" has nice properties, and it seems "obvious", but it means that because git actually *encodes* those three files hierarchically as two different trees (because there's a subdirectory there), the tree objects themselves sort a bit oddly. The tree obejcts themselves will look like top-level tree: 100644 a.c -> blob1 040000 a -> tree2 100644 abc -> blob3 sub-tree: 100644 c -> blob2 and notice how the *tree* is not sorted alphabetically at all. It has a subtly different sort, where the entry "a" sorts *after* the entry "a.c", because we know that it's a tree entry, and thus will (in the *global* order) sort as if it had a "/" at the end! Traditionally, when we have the index, the index sorting has been very simple: you just sort the names as memcmp() would sort them. But note how that changes, if "a" is an empty directory. Now the index needs to sort as file a.c dir a file abc because when we create the tree entry, it needs to be sorted the same way all tree entries are always sorted - as if "a" had a slash at the end! [ Yeah, yeah, we could make a special case and just say "the empty tree sorts differently", but that actually results in huge problems when doing a "diff" between two trees: our diff machinery very much depends on the fact that the index and the trees always sort the same way, and if we sorted the "a" entry (when it is an empty directory) differently from the "a" entry (when it has entries in it), that would just be insane and cause no end of trouble for comparing two trees - one with an empty directory and one with content added to that directory. So the sorting is doubly important: it's what makes "one content" always have the same SHA1, but it is also much easier and efficient to compare directories when we know they are sorted the same way. ] In other words, introducing tree entries in the index ended up also introducing all the issues that we already had with the tree objects since they got split up hierarchically, but that the code didn't use to have to care about. The easiest way to solve this really does seem to be to add the rule that the index entry for an empty directory has to have the "/" at the end of the name - then the "sort mindlessly by name" will just continue to work. But that was what I said was broken: my patches I sent out didn't actually do that. It's *probably* just a few lines of code, and it actually would result in some nice changes ("git ls-files" would show a '/' at the end of an empty directory entry, for example), so this is not a big deal, but it's an example of how subtly different a directory is from a file when it comes to git. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html