Re: Multi-threaded 'git clone'

Jeff King <peff@xxxxxxxx> · Mon, 16 Feb 2015 10:47:45 -0500

On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote:

> >Then the server streams the data to the client. It might do some light
> >work transforming the data as it comes off the disk, but most of it is
> >just blitted straight from disk, and the network is the bottleneck.
> 
> Depending on how close to full the WAN link is, it may be possible to
> improve this with multiple connections (again, referencing bbcp), but
> there's also the question of if it's worth trying to use the entire WAN for
> a single user. The vast majority of the time the server is doing more than
> one thing and would rather let any individual user wait a bit and service
> the other users.

Yeah, I have seen clients that make multiple TCP connections to each
request a chunk of a file in parallel. The short answer is that this is
going to be very hard with git. Each clone generates the pack on the fly
based on what's on disk and streams it out. It should _usually_ be the
same, but there's nothing to guarantee byte-for-byte equality between
invocations. So you'd have to multiplex all of the connections into the
same server process. And even then it's hard; that process knows its
going to send you byte the bytes for object X, but it doesn't know at
exactly which offset until it gets there, which makes sending things out
of order tricky. And the whole output is checksummed by a single sha1
over the whole stream that comes at the end.

I think the most feasible thing would be to quickly spool it to a server
on the LAN, and then use an existing fetch-in-parallel tool to grab it
from there over the WAN.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html