Re: How to resume broke clone ?

Shawn Pearce <spearce@xxxxxxxxxxx> · Thu, 28 Nov 2013 11:15:27 -0800

On Thu, Nov 28, 2013 at 1:29 AM, Jeff King <peff@xxxxxxxx> wrote:
> On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote:
>
>> > Git should be better support resume transfer.
>> > It now seems not doing better it’s job.
>> > Share code, manage code, transfer code, what would it be a VCS we imagine it ?
>>
>> You're welcome to step up and do it. On top of my head  there are a few options:
>>
>>  - better integration with git bundles, provide a way to seamlessly
>> create/fetch/resume the bundles with "git clone" and "git fetch"

We have been thinking about formalizing the /clone.bundle hack used by
repo on Android. If the server has the bundle, add a capability in the
refs advertisement saying its available, and the clone client can
first fetch $URL/clone.bundle.

For most Git repositories the bundle can be constructed by saving the
bundle reference header into a file, e.g.
$GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
created. The bundle can be served by combining the .bh and .pack
streams onto the network. It is very little additional disk overhead
for the origin server, but allows resumable clone, provided the server
has not done a GC.

> I posted patches for this last year. One of the things that I got hung
> up on was that I spooled the bundle to disk, and then cloned from it.
> Which meant that you needed twice the disk space for a moment.

I don't think this is a huge concern. In many cases the checked out
copy of the repository approaches a sizable fraction of the .pack
itself. If you don't have 2x .pack disk available at clone time you
may be in trouble anyway as you try to work with the repository post
clone.

> I wanted
> to teach index-pack to "--fix-thin" a pack that was already on disk, so
> that we could spool to disk, and then finalize it without making another
> copy.

Don't you need to separate the bundle header from the pack data before
you do this? If the bundle is only used at clone time there is no
--fix-thin step.

> One of the downsides of this approach is that it requires the repo
> provider (or somebody else) to provide the bundle. I think that is
> something that a big site like GitHub would do (and probably push the
> bundles out to a CDN, too, to make getting them faster). But it's not a
> universal solution.

See above, I think you can reasonably do the /clone.bundle
automatically on any HTTP server. Big sites might choose to have
/clone.bundle do a redirect into a caching CDN that fills itself by
going to the application servers to obtain the current data. This is
what we do for Android.

>>  - stablize pack order so we can resume downloading a pack
>
> I think stabilizing in all cases (e.g., including ones where the content
> has changed) is hard, but I wonder if it would be enough to handle the
> easy cases, where nothing has changed. If the server does not use
> multiple threads for delta computation, it should generate the same pack
> from the same on-disk deterministically. We just need a way for the
> client to indicate that it has the same partial pack.
>
> I'm thinking that the server would report some opaque hash representing
> the current pack. The client would record that, along with the number of
> pack bytes it received. If the transfer is interrupted, the client comes
> back with the hash/bytes pair. The server starts to generate the pack,
> checks whether the hash matches, and if so, says "here is the same pack,
> resuming at byte X".

An important part of this is the want set must be identical to the
prior request. It is entirely possible the branch tips have advanced
since the prior packing attempt started.

> What would need to go into such a hash? It would need to represent the
> exact bytes that will go into the pack, but without actually generating
> those bytes. Perhaps a sha1 over the sequence of <object sha1, type,
> base (if applicable), length> for each object would be enough. We should
> know that after calling compute_write_order. If the client has a match,
> we should be able to skip ahead to the correct byte.

I don't think Length is sufficient.

The repository could have recompressed an object with the same length
but different libz encoding. I wonder if loose object recompression is
reliable enough about libz encoding to resume in the middle of an
object? Is it just based on libz version?

You may need to do include information about the source of the object,
e.g. the trailing 20 byte hash in the source pack file.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html