Re: Performance issue: initial git clone causes massive repack

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Mon, 6 Apr 2009 07:20:55 -0700

Again, I'm about to leave on a trip for a few days (back late Thursday),
but just wanted to comment in on the thread.

On Mon, Apr 06, 2009 at 12:06:00AM -0400, Nicolas Pitre wrote:
> > While my current pack setup has multiple packs of not more than 100MiB
> > each, that was simply for ease of resume with rsync+http tests. Even
> > when I already had a single pack, with every object reachable,
> > pack-objects was redoing the packing.
> In that case it shouldn't have.
I'll retest that part on my return, but I'm pretty sure I did see the
same excess cputime usage.

> > Also, I did another trace, using some other hardware, in a LAN setting, and
> > noticed that git-upload-pack/pack-objects only seems to start output to the
> > network after it reaches 100% in 'remote: Compressing objects:'.
> That's to be expected.  Delta compression matches objects which are not 
> in the stream order at all.  Therefore it is not possible to start 
> outputting pack data until this pass is done.  Still, this pass should 
> not be invoked if your repository is already fully packed into one pack.  
So it's seeking around the existing packs before sending?

> Can you confirm this is actually the case?
The most recent tests were with the 15(+ one partial) packs limited to a
max of 100MiB each, because that made resume for rsync/http during the
tests much cleaner.

> > Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at 
> > the server in this case cut the 200 wallclock minutes before any 
> > sending too place down to 5 minutes.
> Well... here's a wild guess.  In the source repository serving clone 
> requests, please do:
> 	git config pack.deltaCacheSize 1
> 	git config pack.deltaCacheLimit 0
> and try cloning again with a fully packed repository.
I did the multiple pack case quickly, and found that it does still take
a long time in the low memory case. I'll do the test with a single pack
on my return.

> The caching pack project is to address a different issue: mainly to 
> bypass the object enumeration cost.  In other words, it could allow for 
> skipping the "Counting objects" pass, and a tiny bit more.  At least in 
> theory that's about the main difference.  This has many drawbacks as 
> well though.
Relatedly, would it be possible to keep a cache of enumerated objects
that was trivially updatable during pushes?

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@xxxxxxxxxx
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85
Attachment:
pgpseJNMeb6P3.pgp

Description: PGP signature