Re: How to resume broke clone ?

Jeff King <peff@xxxxxxxx> · Thu, 5 Dec 2013 11:04:18 -0500

On Wed, Dec 04, 2013 at 10:50:27PM -0800, Shawn Pearce wrote:

> I wasn't thinking about using a "well known blob" for this.
> 
> Jonathan, Dave, Colby and I were kicking this idea around on Monday
> during lunch. If the initial ref advertisement included a "mirrors"
> capability the client could respond with "want mirrors" instead of the
> usual want/have negotiation. The server could then return the mirror
> URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial
> compared to the cost to really clone the repository.

I don't think this is any more or less efficient than the blob scheme.
In both cases, the client sends a single "want" line and no "have"
lines, and then the server responds with the output (either pkt-lines,
or a single-blob pack).

What I like about the blob approach is:

  1. It requires zero extra code on the server. This makes
     implementation simple, but also means you can deploy it
     on existing servers (or even on non-pkt-line servers like
     dumb http).

  2. It's very debuggable from the client side. You can fetch the blob,
     look at it, and decide which mirror you want outside of git if you
     want to (true, you can teach the git client to dump the pkt-line
     URLs, too, but that's extra code). You could even do this with an
     existing git client that has not yet learned about the mirror
     redirect.

  3. It removes any size or structure limits that the protocol imposes
     (I was planning to use git-config format for the blob itself). The
     URLs themselves aren't big, but we may want to annotate them with
     metadata.

     You mentioned "this is a bundle" versus "this is a regular http
     server" below. You might also want to provide network location
     information (e.g., "this is a good mirror if you are in Asia"),
     though for the most part I'd expect that to happen magically via
     CDN.

     When we discussed this before, the concept came up of offering not
     just a clone bundle, but "slices" of history (as thin-pack
     bundles), so that a fetch could grab a sequence of resumable
     slices, starting with what they have, and then topping off with a
     true fetch. You would want to provide the start and end points of
     each slice.

  4. You can manage it remotely via the git protocol (more discussion
     below).

  5. A clone done with "--mirror" will actually propagate the mirror
     file automatically.

What are the advantages of the pkt-line approach? The biggest one I can
think of is that it does not pollute the refs namespace. While (5) is
convenient in some cases, it would make it more of a pain if you are
trying to keep a clone mirror up to date, but do _not_ want to pass
along upstream's mirror file.

You may want to have a server implementation that offers a dynamic
mirror, rather than a true object we have in the ODB. That is possible
with a mirror blob, but is slightly harder (you have to fake the object
rather than just dumping a line).

> These pkt-lines need to be a bit more than just URL. Or we need a new
> URL like "bundle:http://...."; to denote a resumable bundle over HTTP
> vs. a normal HTTP URL that might not be a bundle file, and is just a
> better connected server.

Right, I think that's the most critical one (though you could also just
use the convention of ".bundle" in the URL). I think we may want to
leave room for more metadata, though.

> The mirror URLs could be stored in $GIT_DIR/config as a simple
> multi-value variable. Unfortunately that isn't easily remotely
> editable. But I am not sure I care?

For big sites that manage the bundles on behalf of the user, I don't
think it is an issue. For somebody running their own small site, I think
it is a useful way of moving the data to the server.

> For the average home user sharing their working repository over git://
> from their home ADSL or cable connection, editing .git/config is
> easier than a blob in refs/mirrors. They already know how to edit
> .git/config to manage remotes.

Yes, but it's editing .git/config on the server, not on the client,
which may be slightly harder for some people. I do think we'd want
some tool support on the client side. git-config recently learned to
read from a blob. The next step is:

  git config --blob=refs/mirrors --edit

or

  git config --blob=refs/mirrors mirror.ko.url git://git.kernel.org/...
  git config --blob=refs/mirrors mirror.ko.bundle true

We can't add tool support for editing .git/config on the server side,
because the method for doing so isn't standard.

> Heck, remote.origin.url might already
> be a good mirror address to advertise, especially if the client isn't
> on the same /24 as the server and the remote.origin.url is something
> like "git.kernel.org". :-)

You could have a "git-advertise-upstream" that generates a mirror blob
from your remotes config and pushes it to your publishing point. That
may be overkill, but I don't think it's possible with a
.git/config-based solution.

> > That's clever. It does not work out of the box if you are using
> > alternates, but I think it could be adapted in certain situations. E.g.,
> > if you layer the pack so that one "base" repo always has its full pack
> > at the start, which is something we're already doing at GitHub.
> 
> Yes, well, I was assuming the pack was a fully connected repack.
> Alternates always creates a partial pack. But if you have an
> alternate, that alternate maybe should be given as a mirror URL? And
> allow the client to recurse the alternate mirror URL list too?

The problem for us is not that we have a partial pack, but that the
alternates pack has a lot of other junk in it. A linux.git clone is
650MB or so. The packfile for all of the linux.git forks together on
GitHub is several gigabytes.

> What really got us worried was the bundle header has no checksums, and
> a resume in the bundle header from the wrong version could be
> interesting.

The bundle header is small enough that you should just throw it away if
you didn't get the whole thing (IIRC, that is what my patches do,
because it does not do _anything_ until we receive the whole ref
advertisement, at which point we decide if it is smart, dumb, or a
bundle).

> Yes. And this is why the packfile name algorithm is horribly flawed. I
> keep saying we should change it to name the pack using the last 20
> bytes of the file but ... nobody has written the patch for that?  :-)

Totally agree. I think we could also get rid of the horrible hacks in
repack where we pack to a tempfile, then have to do another tempfile
dance (which is not atomic!) to move the same-named packfile out of the
way. If the name were based on the content, we could just throw away our
new pack if one of the same name is already there (just like we do for
loose objects).

I haven't looked at making such a patch, but I think it shouldn't be too
complicated. My big worry would be weird fallouts from some hidden part
of the code that we don't realize is depending on the current naming
scheme. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html