Re: Resumable clone/Gittorrent (again)

Maaartin <grajcar1@xxxxxxxxx> · Wed, 5 Jan 2011 23:28:11 +0000 (UTC)

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

> I've been analyzing bittorrent protocol and come up with this. The
> last idea about a similar thing [1], gittorrent, was given by Nicolas.
> This keeps close to that idea (i.e the transfer protocol must be around git
> objects, not file chunks) with a bit difference.
>
> The idea is to transfer a chain of objects (trees or blobs), including
> base object and delta chain. Objects are chained in according to
> worktree layout, e.g. all objects of path/to/any/blob will form a
> chain, from a commit tip down to the root commits. Chains can have
> gaps, and don't need to start from commit tip. The transfer is
> resumable because if a delta chain is corrupt at some point, we can
> just request another chain from where it stops. Base object is
> obviously resumable.

I may be talking nonsense, please bare with me.

I'm not sure if it works well, since chains defined this way change over time. 
I may request commits A and B while declaring to possess commits C and D. One 
server may be ahead of A, so should it send me more data or repack the chain so 
that the non-requested versions get excluded? At the same time the server may 
be missing B and posses only some ancestors of it. Should it send me only a 
part of the chain or should I better ask a different server?

Moreover, in case a directory gets renamed, the content may get transfered 
needlessly. This is probably no big problem.

I haven't read the whole other thread yet, but what about going the other way 
round? Use a single commit as a chain, create deltas assuming that all 
ancestors are already available. The packs may arrive out of order, so the 
decompression may have to wait. The number of commits may be one order of 
magnitude larger than the the number of paths (there are currently 2254 paths 
and 24235 commits in git.git), so grouping consequent commits into one larger 
pack may be useful.

The advantage is that the packs stays stable over time, you may create them 
using the most aggressive and time-consuming settings and store them forever. 
You could create packs for single commits, packs for non-overlapping 
consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with 
commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7], 
etc. The reason for this is obviously to allow reading groups of commits from 
different servers so that they fit together (similar to Buddy memory 
allocation). Of course, there are things like branches bringing chaos in this 
simple scheme, but I'm sure this can be solved somehow.

Another problem is the client requesting commits A and B while declaring to 
possess commits C and D. When both C and D are ancestors of either A or B, you 
can ignore it (as you assume this while packing, anyway). The other case is 
less probable, unless e.g. C is the master and A is a developing branch. 
Currently. I've no idea how to optimize this and whether this could be 
important.

I see no disadvantage when compared to path-based chains, but am probably 
overlooking something obvious.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html