Re: If I were redoing git from scratch...

Linus Torvalds <torvalds@xxxxxxxx> · Sat, 4 Nov 2006 08:44:02 -0800 (PST)

On Sat, 4 Nov 2006, Junio C Hamano wrote:
> 
> The biggest one is that we use too many static (worse, function
> scope static) variables that live for the life of the process,
> which makes many things very nice and easy ("run-once and let
> exit clean up the mess" mentality), but because of this it
> becomes awkward to do certain things.  Examples are:
> 
>  - Multiple invocations of merge-bases (needs clearing the
>    marks left on commit objects by earlier traversal),

Well, quite frankly, I dare anybody to do it differently, yet have good 
performance with millions of objects.

The fact is, I don't think it _can_ be done. I would seriously suggest 
re-visiting this in five years, just because CPU's and memory will by then 
hopefully have gotten an order of magnitude faster/bigger.

The thing is, the object database when we read it in really needs to be 
pretty compact-sized, and we need to remember objects we've seen earlier 
(exactly _because_ we depend on the flags). So there's exactly two 
alternatives:
 - global life-time allocations of objects like we do now
 - magic memory management with unknown lifetimes and keeping track of all 
   pointers.

And I'd like to point out that the memory management right now is simply 
not realistic:

 - it's too damn hard. A simple garbage collector based on the approach we 
   have now would simply not be able to do anything, since all objects are 
   _by_definition_ reachable from the hash chains, so there's nothing to 
   collect. The lifetime of an object fundamentally _is_ the whole process 
   lifetime, exactly because we expect the objects (and the object flags 
   in particular) to be meaningful.

 - pretty much all garbage collection schemes tend to have a memory 
   footprint that is about twice what a static footprint is under any 
   normal load. Think about what we already do with "git pack-objects" for 
   something like the mozilla repository: I worked quite a lot on getting 
   the memory footprint down, and it's _still_ several hundred MB. 

In other words, I can pretty much guarantee that some kind of "smarter" 
memory management would be a huge step backwards. Yes, we now have to do 
some things explicitly, but exactly because we do them explicitly we can 
_afford_ to have the stupid and simple and VERY EFFICIENT memory 
management ("lack of memory management") that we have now.

The memory use of git had an very real correlation with performance when I 
was doing the memory shrinking a few months back (back in June). I realize 
that it's perhaps awkward, but I would really want people to realize that 
it's a huge performance issue. It was a clear performance issue for me 
(and I use machines with 2GB of RAM, so I was never swapping), it would be 
an even bigger one for anybody where the size meant that you needed to 
start doing paging.

So I would seriously ask you not to even consider changing the object 
model. Maybe add a few more helper routines to clear all object flags or 
something, but the "every object is global and will never be de-allocated" 
is really a major deal.

Five years from now, or for somebody who re-implements git in Java (where 
performance isn't going to be the major issue anyway, and you probably do 
"small" things like "commit" and "diff", and never do full-database things 
like "git repack"), _then_ you can happily look at having something 
fancier. Right now, it's too easy to just look at cumbersome interfaces, 
and forget about the fact that those interfaces is sometimes what allows 
us to practically do some things in the first place.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html