Re: Multi-threaded 'git clone'

Martin Fick <mfick@xxxxxxxxxxxxxx> · Mon, 16 Feb 2015 22:20:02 -0700

There currently is a thread on the Gerrit list about how much faster cloning can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs with bitmaps.

Some differences outlined are that jgit seems to have more bitmaps, it creates one for every refs/heads, is C git doing that?  Another difference seems to be that jgit creates two packs, splitting stuff not reachable from refs/heads into its own pack.  This makes a clone have zero CPU server side in the pristine case.  In the Gerrit use case, this second "unreachable" packfile can be sizeable, I wonder if there are other use cases where this might also be the case (and this slowing down clones for C git GCed repos)?

If there is not a lot of parallelism left to squeak out, perhaps a focus with better returns is trying to do whatever is possible to make all clones (and potentially any fetch use case deemed important on a particular server) have zero CPU?  Depending on what a server's primary mission is, I could envision certain admins willing to sacrifice significant amounts of disk space to speed up their fetches.  Perhaps some more extreme thinking (such as what must have led to bitmaps) is worth brainstorming about to improve server use cases?

What if an admin were willing to sacrifice a packfile for every use case he deemed important, could git be made to support that easily?  For example, maybe the admin considers a clone or a fetch from master to be important, could zero percent CPU be achieved regularly for those two use cases?  Cloning is possible if the repository were repacked in the jgit style after any push to a head.  Is it worth exploring ways of making GC efficient enough to make this feasible?  Can bitmaps be leveraged to make repacking faster?  I believe that at least reachability checking could potentially be improved with bitmaps? Are there potentially any ways to make better deltification reuse during repacking (not bitmap related), by somehow reversing or translating deltas to new objects that were just received, without actually recalculating them, but yet still getting most objects deltified against the newest objects (achieving the same packs as git GC would achieve today, but faster)? What other pieces need to be improved to make repacking faster?

As for the single branch fetch case, could this somehow be improved by allocating one or more packfiles to this use case?  The simplest single branch fetch use case is likely someone doing a git init followed by a single branch fetch.  I think the android repo tool can be used in this way, so this may actually be a common use case?  With a packfile dedicated to this branch, git should be able to just stream it out without any CPU.  But I think git would need to know this packfile exists to be able to use it.  It would be nice if bitmaps could help here, but I believe bitmaps can so far only be used for one packfile.  I understand that making bitmaps span multiple packfiles would be very complicated, but maybe it would not be so hard to support bitmaps on multiple packfiles if each of these were "self contained"?  By self contained I mean that all objects referenced by objects in the packfile were contained in that packfile.

What other still unimplemented caching techniques could be used to improve clone/fetch use cases? 

- Shallow clones (dedicate a special packfile to this, what about another bitmap format that only maps objects in a single tree to help this)?

- Small fetches (simple branch FF updates), I suspect these are fast enough, but if not, maybe caching some thin packs (that could result in zero CPU requests for many clients) would be useful?  Maybe spread these out exponentially over time so that many will be available for recent updates and fewer for older updates?  I know git normally throws away thin packs after receiving them and resolving them, but if it kept them around (maybe in a special directory), it seems that they could be useful for updating other clients with zero CPU?  A thin pack cache might be something really easy to manage based on file timestamps, an admin may simply need to set a max cache size.  But how can git know what thin packs it has, and what they would be useful for, name them with their start and ending shas?

Sorry for the long winded rant. I suspect that some variation of all my suggestions have already been suggested, but maybe they will rekindle some older, now useful thoughts, or inspire some new ones.  And maybe some of these are better to pursue then more parallelism?

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King <peff@xxxxxxxx> wrote:
>
> On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: 
>
> > >Then the server streams the data to the client. It might do some light 
> > >work transforming the data as it comes off the disk, but most of it is 
> > >just blitted straight from disk, and the network is the bottleneck. 
> > 
> > Depending on how close to full the WAN link is, it may be possible to 
> > improve this with multiple connections (again, referencing bbcp), but 
> > there's also the question of if it's worth trying to use the entire WAN for 
> > a single user. The vast majority of the time the server is doing more than 
> > one thing and would rather let any individual user wait a bit and service 
> > the other users. 
>
> Yeah, I have seen clients that make multiple TCP connections to each 
> request a chunk of a file in parallel. The short answer is that this is 
> going to be very hard with git. Each clone generates the pack on the fly 
> based on what's on disk and streams it out. It should _usually_ be the 
> same, but there's nothing to guarantee byte-for-byte equality between 
> invocations. So you'd have to multiplex all of the connections into the 
> same server process. And even then it's hard; that process knows its 
> going to send you byte the bytes for object X, but it doesn't know at 
> exactly which offset until it gets there, which makes sending things out 
> of order tricky. And the whole output is checksummed by a single sha1 
> over the whole stream that comes at the end. 
>
> I think the most feasible thing would be to quickly spool it to a server 
> on the LAN, and then use an existing fetch-in-parallel tool to grab it 
> from there over the WAN. 
>
> -Peff 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
��.n��������+%������w��{.n��������n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�