Re: Submodule object store

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 27 Mar 2007 11:41:11 -0700 (PDT)

On Tue, 27 Mar 2007, Uwe Kleine-König wrote:
> 
> 	embeddedproject$ git ls-tree HEAD | grep linux
> 	040000 commit 0123456789abcde0... linux-2.6
> 
> (or how ever you save submodules).  Then you might have to duplicate the
> objects of linux-2.6, because they are part of both histories.

No they are not. Unless you do it wrong.

The *only* object that is part of the superproject would be the tree that 
*contains* that entry itself.

We should *never* automatically follow such an entry down, *exactly* 
because that doesn't scale. So to actually follow that entry for something 
like a recursive, you'd literally "cd into linux, and start 'git diff' 
from commit 0123456.."

In other words, the subproject would be its own project, and the 
superproject never sees it as "part of itself". I really think, for 
example, that the "git diff" family of programs (diff-index, diff-tree, 
diff-files) and things like "git ls-tree" should literally:

 - have a mode where they don't even recurse into subprojects, and I 
   personally think that it could/should be the default!

 - when they recurse, they should literally (at least to begin with) do 
   that kind of "fork() ; if (child) { chdir(subproject); execve(myself) }" 

The latter is really to make sure that *even*by*mistake* we don't screw 
things up and tie the sub/superproject together too tightly. 

I'm serious. I really think that the first version (which ends up being 
the one that sets semantics) should be very careful here, so that 
subprojects never get mixed up with the superproject.

And I'm also serious about the "don't recurse into subproject by default 
at all". If I'm at the superproject, and I maintain the superproject, I 
think the state of the subprojects themselves are a totally separate 
issue. It's quite a valid thing to do to maintain the build 
infrastructure, and if I'm the maintainer of that, and I do "git diff", I 
sure as hell don't want to wait for git to do "git diff" on the 
subprojects when there are 5000 of them!

Sure, "git diff" is fast (on the kernel, it takes me 0.069s on a clean 
tree), but 

 - multiply that 0.069s by 5000 and it's not so fast any more

 - when you have a thousand subprojects, it's quite possible (even likely) 
   that all your directories won't fit in the cache any more, and suddenly 
   even a single "git diff" takes several seconds.

Really! Try this on the Linux tree (that "drop_caches" thing needs root 
privileges):

	echo 3 > /proc/sys/vm/drop_caches
	git diff

and see it take something like 5 seconds. Now, imagine that you have a 
hundred subprojects, and they're big enough that the caches are *never* 
warm.

People sometimes don't seem to understand what "scalability" really means. 
Scalability means that something that is so fast that you don't even 
*think* about it will become a major bottleneck when you do it a thousand 
times, and the working set has grown so big that it totally blows out 
several levels of caches (both CPU caches and disk caches)

		Linus