Re: How to resume broke clone ?

Jeff King <peff@xxxxxxxx> · Thu, 28 Nov 2013 04:29:35 -0500

On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote:

> > Git should be better support resume transfer.
> > It now seems not doing better it’s job.
> > Share code, manage code, transfer code, what would it be a VCS we imagine it ?
> 
> You're welcome to step up and do it. On top of my head  there are a few options:
> 
>  - better integration with git bundles, provide a way to seamlessly
> create/fetch/resume the bundles with "git clone" and "git fetch"

I posted patches for this last year. One of the things that I got hung
up on was that I spooled the bundle to disk, and then cloned from it.
Which meant that you needed twice the disk space for a moment. I wanted
to teach index-pack to "--fix-thin" a pack that was already on disk, so
that we could spool to disk, and then finalize it without making another
copy.

One of the downsides of this approach is that it requires the repo
provider (or somebody else) to provide the bundle. I think that is
something that a big site like GitHub would do (and probably push the
bundles out to a CDN, too, to make getting them faster). But it's not a
universal solution.

>  - stablize pack order so we can resume downloading a pack

I think stabilizing in all cases (e.g., including ones where the content
has changed) is hard, but I wonder if it would be enough to handle the
easy cases, where nothing has changed. If the server does not use
multiple threads for delta computation, it should generate the same pack
from the same on-disk deterministically. We just need a way for the
client to indicate that it has the same partial pack.

I'm thinking that the server would report some opaque hash representing
the current pack. The client would record that, along with the number of
pack bytes it received. If the transfer is interrupted, the client comes
back with the hash/bytes pair. The server starts to generate the pack,
checks whether the hash matches, and if so, says "here is the same pack,
resuming at byte X".

What would need to go into such a hash? It would need to represent the
exact bytes that will go into the pack, but without actually generating
those bytes. Perhaps a sha1 over the sequence of <object sha1, type,
base (if applicable), length> for each object would be enough. We should
know that after calling compute_write_order. If the client has a match,
we should be able to skip ahead to the correct byte.

>  - remote alternates, the repo will ask for more and more objects as
> you need them (so goodbye to distributed model)

This is also something I've been playing with, but just for very large
objects (so to support something like git-media, but below the object
graph layer). I don't think it would apply here, as the kernel has a lot
of small objects, and getting them in the tight delta'd pack format
increases efficiency a lot.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html