Re: Performance issue: initial git clone causes massive repack

Nicolas Pitre <nico@xxxxxxx> · Tue, 14 Apr 2009 16:17:55 -0400 (EDT)

On Tue, 14 Apr 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Fri, 10 Apr 2009, Robin H. Johnson wrote:
> 
> > On Wed, Apr 08, 2009 at 12:52:54AM -0400, Nicolas Pitre wrote:
> > > > http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> > > > At least that's what I cloned ;-) I hope it's the right one, but it fits
> > > > the description...
> > > OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
> > > 725MB pack file.  So that's about half the originally reported size.
> > The one problem with having the single large packfile is that Git
> > doesn't have a trivial way to resume downloading it when the git://
> > protocol is used.
> > 
> > For our developers cursed with bad internet connections (a fair number
> > of firewalls that don't seem to respect keepalive properly), I suppose
> > I can probably just maintain a separate repo for their initial clones,
> > which leaves a large overall download, but more chances to resume.
> 
> IMO the best we could do under these circumstances is to use fsck 
> --lost-found to find those commits which have a complete history (i.e. no 
> "broken links") -- this probably needs to be implemented as a special mode 
> of --lost-found -- and store them in a temporary to-be-removed 
> namespace, say refs/heads/incomplete-refs/$number, which will be sent to 
> the server when fetching the next time.  (Might need some iterations to 
> get everything, though.)

Well, although this might seem a good idea, this would help only in 
those cases where there is at least one complete revision available, 
i.e. no delta needed. This is usually true for the top commit after a 
repack which objects are all stored at the front of the pack and serve 
as base objects for deltas from subsequent (older) commits.  Thing is, 
that first revision is likely to occupy a significant portion of the 
whole pack, like no less than the size of the equivalent .tar.gz for the 
content of that commit.  To see what this represents, just try a shallow 
clone with depth=1.  For the Linux kernel, this is more than 80MB while 
the whole repo is in the 200MB range.  So if your connection isn't 
reliable enough to transfer at least that amount then you're screwed 
anyway.

Independently from this, I think there is quite a lot of confusion here.  
According to Robin, the reason for splitting the large Gentoo repo into 
multiple packs is apparently to help with the resuming of a clone.  We 
know that the git:// protocol is currently not resumable, and having 
multiple packs on the remote server won't change the outcome in any way 
as the client still receives a single big pack anyway.

WRT the HTTP protocol, I was questioning git's ability to resume the 
transfer of a pack in the middle if such transfer is interrupted without 
redownloading it all. And Mike Hommey says this is actually the case.

Meaning there is simply no reason to split a big pack into multiple 
ones.  If anything, it'll only make a clone over the native git protocol 
more costly for the server which has to pack everything back together.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html