Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jeff King <peff@xxxxxxxx> · Wed, 10 Feb 2016 16:49:46 -0500

On Wed, Feb 10, 2016 at 12:11:46PM -0800, Shawn Pearce wrote:

> On Wed, Feb 10, 2016 at 10:59 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> >
> > ... Thoughts?
> 
> Several of us at $DAY_JOB talked about this more today and thought a
> variation makes more sense:
> 
> 1. Clients attempting clone ask for /info/refs?service=git-upload-pack
> like they do today.
> 
> 2. Servers that support resumable clone include a "resumable"
> capability in the advertisement.

Because the magic happens in the git protocol, that would mean this does
not have to be limited to git-over-http. It could be "resumable=<url>"
to point the client anywhere (the same server over a different protocol,
another server, etc).

> 3. Updated clients on clone request GET /info/refs?service=git-resumable-clone.
> 
> 4. The server may return a 302 Redirect to its current "mostly whole"
> pack file. This can be more flexible than "refs/heads/*", it just
> needs to be a mostly complete pack file that contains a complete graph
> from any arbitrary roots.

And with "resumable=<url>", the client does not have to hit the server
to do a redirect; it can go straight to the final URL, saving a
round-trip.

> 5. Clients fetch the file using standard HTTP GET, possibly with
> byte-ranges to resume.
> 
> 6. Once stored and indexed with .idx, clients run `git fsck
> --lost-found` to discover the roots of the pack it downloaded. These
> are saved as temporary references.

Clients do not have to _just_ fetch a packfile. They could get a bundle
file that contains the roots along with the packfile. I know that one of
your goals is not duplicating the storage of the packfile on the server,
but it would not be hard for the server to store the packfile and the
bundle header separately, and concatenate them on the fly.

Right now the clients can't clone from bundles directly via HTTP. I
wrote patches for that ages ago, but got stuck on this very issue
(basically that I had to spool the bundle and then clone from it, which
temporarily doubled the client's disk space requirements). One
alternative would be to amend the bundle format so that rather than a
single file, you get a bundle header whose end says "...and my matching
packfile is 1234-abcd". And then the client knows that they can fetch
that separately from the same source.

It's an extra HTTP request, but it makes the code for client _and_
server way simpler. So the whole thing is basically then:

  0. During gc, server generates pack-1234abcd.pack. It writes matching
     tips into pack-1234abcd.info, which is essentially a bundle file
     whose final line says "pack-1234abcd.pack".

  1. Client contacts server via any git protocol. Server says
     "resumable=<url>". Let's says that <url> is
     https://example.com/repo/clones/1234abcd.bundle.

  2. Client goes to <url>. They see that they are fetching a bundle,
     and know not to do the usual smart-http or dumb-http protocols.
     They can fetch the bundle header resumably (though it's tiny, so it
     doesn't really matter).

  3. After finishing the bundle header, they see they need to grab the
     packfile. Based on the bundle header's URL and the filename
     contained within it, they know to get
     https://example.com/repo/clones/pack-1234abcd.pack";. This is
     resumable, too.

  4. Client clones from bundled pack as normal; no root-finding magic
     required.

  5. Client runs incremental fetch against original repo from step 1.

And you'll notice, too, that all of the bundle-http magic kicks in
during step 2 because the client sees they're grabbing a bundle. Which
means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
"go fetch from kernel.org, then come back to me".

> An advantage to this process is its much more flexible for the server.
> There is no additional pack-*.info file required. GC can organize
> packs anyway it wants, etc.

Yes, it's much better than your original email, at least for GitHub
servers. We're not very flexible with GC tricks, because we need bitmaps
to work, and because we get a lot of benefit from sharing the object
storage for forks of a single repository.

> To make step 4 really resume well, clients may need to save the first
> Location header it gets back from
> /info/refs?service=git-resumable-clone and use that on resume. Servers
> are likely to embed the pack SHA-1 in the Location header, and the
> client wants to use this on subsequent GET attempts to abort early if
> the server has deleted the pack the client is trying to obtain.

You could possibly do away with this trick if the server hands out a
unique URL in its "resumable" header. Though I imagine it might be
convenient for server admins to always point to a generic url, and
put the logic in the HTTP layer.

OTOH, if you do the "split bundle" thing I mentioned above, then this
happens for free. The client caches the bundle header it grabs in my
step 2, and then that contains the unique pack name to fetch in step 3.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html