Re: How to resume broke clone ?

Shawn Pearce <spearce@xxxxxxxxxxx> · Thu, 5 Dec 2013 07:11:15 -0800

On Thu, Dec 5, 2013 at 5:21 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
> This discussion has mostly been about letting small Git servers delegate
> the work of an initial clone to a beefier server.  I haven't seen any
> explicit mention of the inverse:
>
> Suppose a company has a central Git server that is meant to be the
> "single source of truth", but has worldwide offices and wants to locate
> bootstrap mirrors in each office.  The end users would not even want to
> know that there are multiple servers.  Hosters like GitHub might also
> encourage their big customers to set up bootstrap mirror(s) in-house to
> make cloning faster for their users while reducing internet traffic and
> the burden on their own infrastructure.  The goal would be to make the
> system transparent to users and easily reconfigurable as circumstances
> change.

I think there is a different way to do that.

Build a caching Git proxy server. And teach Git clients to use it.

One idea we had at $DAY_JOB a couple of years ago was to build a
daemon that sat in the background and continuously fetched content
from repository upstreams. We made it efficient by modifying the Git
protocol to use a hanging network socket, and the upstream server
would broadcast push pack files down these hanging streams as pushes
were received.

The original intent was for an Android developer to be able to have
his working tree forest of 500 repositories subscribe to our internal
server's broadcast stream. We figured if the server knows exactly
which refs every client has, because they all have the same ones, and
their streams are all still open and active, then the server can make
exactly one incremental thin pack and send the same copy to every
client. Its "just" a socket write problem. Instead of packing the same
stuff 100x for 100x clients its packed once and sent 100x.

Then we realized remote offices could also install this software on a
local server, and use this as a fan-out distributor within the LAN. We
were originally thinking about some remote offices on small Internet
connections, where delivery of 10 MiB x 20 was a lot but delivery of
10 MiB once and local fan-out on the Ethernet was easy.

The JGit patches for this work are still pending[1].

If clients had a local Git-aware cache server in their office and
~/.gitconfig had the address of it, your problem becomes simple.

Clients clone from the public URL e.g. GitHub, but the local cache
server first gives the client a URL to clone from itself. After that
is complete then the client can fetch from the upstream. The cache
server can be self-maintaining, watching its requests to see what is
accessed often-ish, and keep those repositories current-ish locally by
running git fetch itself in the background.

Its easy to do this with bundles on "CDN" like HTTP. Just use the
office's caching HTTP proxy server. Assuming its cache is big enough
for those large Git bundle payloads, and the viral cat videos. But you
are at the mercy of the upstream bundler rebuilding the bundles. And
refetching them in whole. Neither of which is great.

A simple self-contained server that doesn't accept pushes, but knows
how to clone repositories, fetch them periodically, and run `git gc`,
works well. And the mirror URL extension we have been discussing in
this thread would work fine here. The cache server can return URLs
that point to itself. Or flat out proxy the Git transaction with the
origin server.

[1] https://git.eclipse.org/r/#/q/owner:wetherbeei%2540google.com+status:open,n,z
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html