Re: git-index-pack really does suck..

Nicolas Pitre <nico@xxxxxxx> · Tue, 03 Apr 2007 17:21:12 -0400 (EDT)

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> > > 
> > > Yeah. What happens is that inside the repo, because we do all the 
> > > duplicate object checks (verifying that there are no evil hash collisions) 
> > > even after fixing the memory leak, we end up keeping *track* of all those 
> > > objects.
> > 
> > What do you mean?
> 
> Look at what we have to do to look up a SHA1 object.. We create all the 
> lookup infrastructure, we don't *just* read the object. The delta base 
> cache is the most obvious one. 

It is caped to 16MB, so we're far from the 200+ MB count.

> > I'm of the opinion that this patch is unnecessary.  It only helps in 
> > bogus workflows to start with, and it makes the default behavior unsafe 
> > (unsafe from a paranoid pov, but still).  And in the _normal_ workflow 
> > it should never trigger.
> 
> Actually, even in the normal workflow it will do all the extra unnecessary 
> work, if only because the lookup costs of *not* finding the entry.
> 
> Lookie here:
> 
>  - git index-pack of the *git* pack-file in the v2.6/linux directory (zero 
>    overlap of objects)
> 
>    With --paranoid:
> 
> 	2.75user 0.37system 0:03.13elapsed 99%CPU
> 	0major+5583minor pagefaults
> 
>    Without --paranoid:
> 
> 	2.55user 0.12system 0:02.68elapsed 99%CPU
> 	0major+2957minor pagefaults
> 
> See? That's the *normal* workflow. Zero objects found. 7% CPU overhead 
> from just the unnecessary work, and almost twice as much memory used. Just 
> from the index file lookup etc for a decent-sized project.

7% overhead over 2 second and a half of CPU which, _normally_, happens 
when cloning the whole thing over a network connection which, if you're 
lucky and have a 6mbps cable connection, will still be spread over 5 
minutes of real time.  And that is assuming that you're cloning a big 
project inside itself which wouldn't work anyway.  Otherwise a big clone 
wound run index-pack in an empty repo where the lookup of exinsting 
object is zero.  Remains git-fetch which should concern itself with much 
smaller packs pushing this overhead in the noise.

> Now, in the KDE situation, the *unnecessary* lookups will be about ten 
> times more expensive, both on memory and CPU, just because the repository 
> is about 20x the size. Even with no actual hits.

So?  When would you really perform such an operation in a meaningful 
way?

The memory usage worries me.  I still cannot explain nor justify it.  
But the CPU overhead is certainly not of any concern in _normal_ usage 
scenarios, is it?

If anything that might be a good test case for the newton-raphson pack 
lookup idea.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html