Re: [RFC] Submodules in GIT

Linus Torvalds <torvalds@xxxxxxxx> · Sat, 2 Dec 2006 13:22:22 -0800 (PST)

On Sat, 2 Dec 2006, Linus Torvalds wrote:
> 
> And watch the memory usage.

Btw, just in case you don't understand _why_ this is true, the fact is, in 
a git repository, quite fundamentally, because we don't have "backlinks" 
at any stage at all, we don't know - and fundamentally _cannot_ know - 
whether we're goign to see the same object in the future.

So operations like "git-rev-list --objects" (or, these days, more commonly 
anything that just does the equivalent of that internally using the 
library interfaces - ie "git pack-objects" and friends) VERY FUNDAMENTALLY 
have to hold on to the object flags for the whole lifetime of the whole 
operation.

And you should realize that this is really really fundamental. You can't 
fix it with "smarter memory management". You can't fix it with "garbage 
collection". This is _not_ a result of the fact that we use C and malloc, 
and we don't free those objects, like some people sometimes seem to 
believe.

So garbage collection will never help this kind of situation. It flows 
_directly_ from the fact that our objects are immutable: because they are 
immutable, they don't have any backpointers, because we cannot (and must 
not) add backpointers to an old existing object when a new object is 
created that points to it.

So this really isn't a memory management issue. You could somewhat work 
around it by adding a "caching layer" on top of git, and allow that 
caching layer to modify their cache of old objects (so that they can 
contain back-pointers), but for 99% of all users that would actually make 
performance MUCH WORSE, and it would also be a serious problem for 
coherency issues (one of the things that immutable objects cause is that 
there are basically never any race conditions, while a "caching layer" 
like this would have some serious issues about serialization).

So: the very fundamental nature and choices that were made in git also 
means that when you have something like git-pack-objects that wants to 
walk the whole repo, you will end up with something that remembers EVERY 
SINGLE OBJECT it walked. 

And while I've worked very hard to make the memory footprint of individual 
objects as small as possible, and this means that this all works fine even 
for fairly large databases (especially since very few operations actually 
do this "traverse the whole friggin tree" thing), it does mean that 
there's a very fundamental limit to scalability. You can't just make a 
whole repository a hundred times bigger - because the operations that 
traverse the whole thing will require a hundred times more memory!

Now, in "real" projects, this is not a problem. I can pretty much 
_guarantee_ that memory sizes and hardware will grow faster than projects 
grow. I'm not AT ALL worried about the fact that in ten years, the linux 
kernel repository will likely be two or three times the size it is now. 
Because I'm absolutely convinced that in ten years, the machines we have 
now will be obsolete.

So on any "individual project" basis, the fact that memory requirements 
scale roughly as O(n) in the total repository size is simply not a 
problem. In fact, O(n) is pretty damn good, especially since the constant 
is pretty small (basically 28 bytes per object - and 20 of those bytes 
are the SHA1 that you simply cannot avoid).

But it does mean that supermodules really should NOT be so seamless that 
doing a "git clone" on a supermodule does one _large_ clone. Because it's 
simply going to be better to:

 - when you clone the supermodule, track the commits you need on all 
   submodules (this _may_ be a reason in itself for the "link" object, 
   just so that you can traverse the supermodule object dependencies and 
   know what subobject you are looking at even _without_ having to look at 
   the path you got there from)

 - clone submodules one-by-one, using the list of objects you gathered.

Maybe there are other solutions, but quite frankly, I doubt it. Yes, 
you'll end up "traversing" exactly as many objects either way, but the 
"globe subobjects one by one" is going to be a _hell_ of a lot more 
memory-efficient, and quite frankly, "memory usage" == "performance"  
under many loads (notably, any load that uses too much memory will _suck_ 
performance-wise, either because of swapping or simply because it will 
throw out caches that "many small invocations" would not have thrown out).

So I guarantee that it's going to be better to do five clones of five 
small repositories over one clone of one big one. If only because you need 
less memory to do the five smaller clones.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html