Re: [RFC PATCH] Re: Empty directories...

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 21 Jul 2007 08:53:39 -0700 (PDT)

On Sat, 21 Jul 2007, David Kastrup wrote:

> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:
> 
> > Of course, it seldom matters, but basically, you should test a directory 
> > structure that has the files
> >
> > 	dir.c
> > 	dir/test
> >
> > in it, and the "dir" directory should always sort _after_ "dir.c".
> >
> > And yes, having the index entry with a '/' at the end would handle
> > that automatically.
> 
> You completely lost me here.  I guess I'll be able to pick this up
> only after investing considerable more time into the data structures.

So the basic issue is that not only does git obviously think that only 
content matters, but it describes it with a single SHA1. 

That's not an issue at all for a single file, but if you want to describe 
*multiple* files with a single SHA1 (which git obviously very much wants 
to do), the way you generate the SHA1 matters a lot.

In particular, the order.

So git is very very strict about the ordering of tree structures. A tree 
structure is not just a random list of

	<ASCII mode> + <space> + <filename> + <NUL> + <SHA1>

it's very much an _ordered_ list of those things, because we want the SHA1 
of the tree to be well-specified by the contents, and that means that the 
contents of a tree object has have absolutely _zero_ ambiguity.

This means, for example, that git is very fundamentally case sensitive. 
There's no sane way *not* to be, because if you're case insensitive in any 
way at all, you'll end up having two trees that are "the same", but end up 
having different SHA1's.

It also means that git objects have absolutely zero "localization". There 
is no locale at all, and there very fundamnetally *must*not* be. Again, 
for the same reason: if you can describe the same filename with two 
different encodings, you'd have two different SHA1's for the same content.

So git filenames are very much a "stream of bytes", not anything else. And 
they need to sort 100% reliably, always the same way, and never with any 
localized meaning.

And, partly because it seemed most natural, and partly for historical 
reasons, the way git sorts filenames is by sorting by *pathname*. So if 
you have three files named

	a.c
	a/c
	abc

then they sort in that exact order, and no other! They sort as a "memcmp" 
in the full pathname, and that's really nice when you see whole 
collections of files, and you know the list is globally sorted.

So that "global pathname sorting" has nice properties, and it seems 
"obvious", but it means that because git actually *encodes* those three 
files hierarchically as two different trees (because there's a 
subdirectory there), the tree objects themselves sort a bit oddly. The 
tree obejcts themselves will look like

 top-level tree:
	100644 a.c -> blob1
	040000 a   -> tree2
	100644 abc -> blob3

 sub-tree:
	100644 c    -> blob2

and notice how the *tree* is not sorted alphabetically at all. It has a 
subtly different sort, where the entry "a" sorts *after* the entry "a.c", 
because we know that it's a tree entry, and thus will (in the *global* 
order) sort as if it had a "/" at the end!

Traditionally, when we have the index, the index sorting has been very 
simple: you just sort the names as memcmp() would sort them. But note how 
that changes, if "a" is an empty directory. Now the index needs to sort as

	file a.c
	dir  a
	file abc

because when we create the tree entry, it needs to be sorted the same way 
all tree entries are always sorted - as if "a" had a slash at the end!

[ Yeah, yeah, we could make a special case and just say "the empty tree 
  sorts differently", but that actually results in huge problems when 
  doing a "diff" between two trees: our diff machinery very much depends 
  on the fact that the index and the trees always sort the same way, and 
  if we sorted the "a" entry (when it is an empty directory) differently 
  from the "a" entry (when it has entries in it), that would just be 
  insane and cause no end of trouble for comparing two trees - one with an 
  empty directory and one with content added to that directory.

  So the sorting is doubly important: it's what makes "one content" always 
  have the same SHA1, but it is also much easier and efficient to compare 
  directories when we know they are sorted the same way. ]

In other words, introducing tree entries in the index ended up also 
introducing all the issues that we already had with the tree objects since 
they got split up hierarchically, but that the code didn't use to have to 
care about.

The easiest way to solve this really does seem to be to add the rule that 
the index entry for an empty directory has to have the "/" at the end of 
the name - then the "sort mindlessly by name" will just continue to work.

But that was what I said was broken: my patches I sent out didn't actually 
do that.

It's *probably* just a few lines of code, and it actually would result in 
some nice changes ("git ls-files" would show a '/' at the end of an empty 
directory entry, for example), so this is not a big deal, but it's an 
example of how subtly different a directory is from a file when it comes 
to git.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html