Re: pack operation is thrashing my server

Nicolas Pitre <nico@xxxxxxx> · Wed, 13 Aug 2008 13:04:03 -0400 (EDT)

On Wed, 13 Aug 2008, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@xxxxxxx> wrote:
> > Well, we are talking about 50MB which is not that bad.
> 
> I think we're closer to 100MB here due to the extra overheads
> I just alluded to above, and which weren't in your 104 byte
> per object figure.

Sure.  That should still be workable on a machine with 256MB of RAM.

> > However there is a point where we should be realistic and just admit 
> > that you need a sufficiently big machine if you have huge repositories 
> > to deal with.  Git should be fine serving pull requests with relatively 
> > little memory usage, but anything else such as the initial repack simply 
> > require enough RAM to be effective.
> 
> Yea.  But it would also be nice to be able to just concat packs
> together.  Especially if the repository in question is an open source
> one and everything published is already known to be in the wild,
> as say it is also available over dumb HTTP.  Yea, I know people
> like the 'security feature' of the packer not including objects
> which aren't reachable.

It is not only that, even if it is a point I consider important.  If you 
end up with 10 packs, it is likely that a base object in each of those 
packs could simply be a delta against a single common base object, and 
therefore the amount of data to transfer might be up to 10 times higher 
than necessary.

> But how many times has Linus published something to his linux-2.6
> tree that he didn't mean to publish and had to rewind?  I think
> that may be "never".  Yet how many times per day does his tree get
> cloned from scratch?

That's not a good argument.  Linus is a very disciplined git users, 
probably more than average.  We should not use that example to paper 
over technical issues.

> This is also true for many internal corporate repositories.
> Users probably have full read access to the object database anyway,
> and maybe even have direct write access to it.  Doing the object
> enumeration there is pointless as a security measure.

It is good for network bandwidth efficiency as I mentioned.

> I'm too busy to write a pack concat implementation proposal, so
> I'll just shutup now.  But it wouldn't be hard if someone wanted
> to improve at least the initial clone serving case.

A much better solution would consist of finding just _why_ object 
enumeration is so slow.  This is indeed my biggest grip with git 
performance at the moment.

|nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
|
|real    0m21.742s
|user    0m21.379s
|sys     0m0.360s

That's way too long for 1030198 objects (roughly 48k objects/sec).  And 
it gets even worse with the gcc repository:

|nico@xanadu:gcc> time git rev-list --objects --all > /dev/null
|
|real    1m51.591s
|user    1m50.757s
|sys     0m0.810s

That's for 1267993 objects, or about 11400 objects/sec.

Clearly something is not scaling here.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html