Re: Resumable git clone?

Bhavik Bavishi <bhavikdbavishi@xxxxxxxxx> · Wed, 2 Mar 2016 16:17:12 +0530

On 3/2/16 2:02 PM, Jeff King wrote:
On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote:

As a simple proposal, the server could send the list of hashes (in
approximately the same order it would send the pack), the client could
send back a bitmap where '0' means "send it" and '1' means "got that one
already", and the client could compress that bitmap.  That gives you the
RLE and similar without having to write it yourself.  That might not be
optimal, but it would likely set a high bar with minimal effort.

We have an implementation of EWAH bitmap compression, so compressing
is not a problem.

But I still don't see why it's more efficient to have the server send
the hash list to the client. Assume you need to transfer N objects.
That direction makes you always send N hashes. But if the client sends
the list of already fetched objects, M, then M <= N. And we won't need
to send the bitmap. What did I miss?

Right, I don't see what the point is in compressing the bitmap. The sha1
list for a clone of linux.git is 87 megabytes. The return bitmap, even
naively, is 500K. Unless you are trying to optimize for wildly
asymmetric links.

If the client just naively sends "here's what I have", then we know it
can never be _more_ than 87 megabytes. And as a bonus, the longer the
list is, the more we are saving (so at the moment you are sending 82MB,
it's really worth it, because you do have 95% of the pack, which is
worth amortizing).

I'm still a little dubious that anything involving "send all the hashes"
is going to be useful in practice, especially for something like the
kernel (where you have tons of huge small objects that delta well). It
would work better when you have gigantic objects that don't delta (so
the cost of a sha1 versus the object size is way better), but then I
think we'd do better to transfer all of the normal-sized bits up front,
and then allow fetching the large stuff separately.

-Peff

In case if we can have object-lookup-db like provisioning with stored 
information like SHA-1, type of object, parent if any, size of that 
object, as in entire hierarchy tree without data like commit message, 
tag name. This implementation may be look as bit duplication of existing 
information.

At initial clone time server sends object-lookup-db to client and then, 
by reading object-lookup-db client sends SHA1 to server to get/fecth 
objects, it can be got in parallel, as well. This process may not be 
transfer efficient but it can be resumable, as client knows what got 
sync and what's remain and which SHA1 refers to what object type.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html