Re: Continue git clone after interruption

Sam Vilain <sam@xxxxxxxxxx> · Fri, 21 Aug 2009 10:57:42 +1200

On Thu, 2009-08-20 at 09:37 +0200, Jakub Narebski wrote:
> You would have the same (or at least quite similar) problems with 
> deepening part (the 'incrementals' transfer part) as you found with my
> first proposal of server bisection / division of rev-list, and serving
> 1/Nth of revisions (where N is selected so packfile is reasonable) to
> client as incrementals.  Yours is top-down, mine was bottom-up approach
> to sending series of smaller packs.  The problem is how to select size
> of incrementals, and that incrementals are all-or-nothing (but see also
> comment below).

I've defined a way to do this which doesn't have the complexity of
bisect in GitTorrent, making the compromise that you can't guarantee
each chunk is exactly the same size... I'll have a crack at doing it
based on the rev-cache code in C instead of the horrendously slow
Perl/Berkeley solution I have at the moment to see how well it fares.

> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same.  But that might be
> difficult, or even impossible.

delta compression is not repeatable enough for this.

This was an assumption made by the first version of GitTorrent, that
this would be an appropriate solution.  

So, first you have to sort the objects - that's fine, --date-order is a
good starting point, then I reasoned that interleaving new objects for
each commit with commit objects would be a useful sort order.  You also
need to tie-break for commits with the same commit date; I just used the
SHA-1 of the commit for that.  Finally, when making packs to avoid
excessive transfer you have to try to make sure that they are "thin"
packs.

Currently, thin packs can only work starting at the beginning of history
and working forward, which is opposite to what happens most of the time
in packs.  I think this is the source of much of the inefficiency caused
by chopping up the object lists mentioned in my other e-mail.  It might
be possible, if you could also know which earlier objects were using
this object as a delta base, to try delta'ing against all those objects
and see which one results in the smallest delta.

> Well, we could send the list of objects in pack in order used later by
> pack creation to client (non-resumable but small part), and if packfile
> transport was interrupted in the middle client would compare list of 
> complete objects in part of packfile against this manifest, and sent
> request to server with *sorted* list of object it doesn't have yet.
> Server would probably have to check validity of objects list first (the
> object list might be needed to be more than just object list; it might
> need to specify topology of deltas, i.e. which objects are base for which
> ones).  Then it would generate rest of packfile.

Mmm.  It's a bit chatty, that.  Object lists add another 10-20% on,
which I think should be avoidable if the thin pack problem, plus the
problem of some objects ending up in more than one of the thin packs
that are created, should be reduced to very little.

Sam

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html