Re: Performance issue: initial git clone causes massive repack

Nicolas Pitre <nico@xxxxxxx> · Tue, 07 Apr 2009 13:48:02 -0400 (EDT)

On Tue, 7 Apr 2009, Björn Steinbrink wrote:

> On 2009.04.07 09:13:45 -0400, Nicolas Pitre wrote:
> > Having git-rev-list consume about 2G RSS for the enumeration of 4M 
> > objects is simply inacceptable, period.  This is the equivalent of 500 
> > bytes per object pinned in memory on average, just for listing object, 
> > which is completely silly. We ought to do better than that.
> 
> Ah, crap, I might have been fooled by "ps aux", top actually shows about
> 1.3G being shared, likely the mmapped pack files. And that will be
> reused, assuming the box has enough memory to keep all that stuff.

Right.  And since the pack is mapped read-only, it can be paged out 
easily by the OS.  And if that doesn't help, we already have 
core.packedGitWindowSize and core.packedGitLimit config options to play 
with.

> But that's still 700MB or about 150 bytes per object on average.
> 
> A "struct tree" is 40 bytes here, adding the average path length (19 in
> this repo) that's 59 byte, leaving about 90 bytes of "overhead" per
> object, as end the end we seem to care only about the sha1 and the path
> name.

I'm starting to think more seriously about pack v4 again, where each 
path components are indexed in a table.  Because most tree objects are 
different revisions of the same path, this could represent a significant 
saving in memory as well.

> And in the upload-pack case, there's also pack-objects running
> concurrently, already going up to 950M RSS/100M shared _while_ the
> rev-list is still running. So that's 3G of memory usage (2G if you
> ignore the shared stuff) before the "Compressing objects" part even
> starts. And of course, pack-objects will apparently start to mmap the
> pack files only after the rev-list finished, so a "smart" OS might have
> removed a lot of the mmapped stuff from memory again, causing it to be
> re-read. :-/

The first low hanging fruit to help this case is to make upload-pack use 
the --revs argument with pack-object to let it do the object enumeration 
itself directly, instead of relying on the rev-list output through a 
pipe.  This is what 'git repack' does already.  pack-objects has to 
access the pack anyway, so this would eliminate an extra access from a 
different process.

Nicolas