Re: Resumable clone/Gittorrent (again)

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Thu, 6 Jan 2011 08:32:12 +0700

On Thu, Jan 6, 2011 at 6:28 AM, Maaartin <grajcar1@xxxxxxxxx> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>> I've been analyzing bittorrent protocol and come up with this. The
>> last idea about a similar thing [1], gittorrent, was given by Nicolas.
>> This keeps close to that idea (i.e the transfer protocol must be around git
>> objects, not file chunks) with a bit difference.
>>
>> The idea is to transfer a chain of objects (trees or blobs), including
>> base object and delta chain. Objects are chained in according to
>> worktree layout, e.g. all objects of path/to/any/blob will form a
>> chain, from a commit tip down to the root commits. Chains can have
>> gaps, and don't need to start from commit tip. The transfer is
>> resumable because if a delta chain is corrupt at some point, we can
>> just request another chain from where it stops. Base object is
>> obviously resumable.
>
> I may be talking nonsense, please bare with me.
>
> I'm not sure if it works well, since chains defined this way change over time.
> I may request commits A and B while declaring to possess commits C and D. One
> server may be ahead of A, so should it send me more data or repack the chain so
> that the non-requested versions get excluded? At the same time the server may
> be missing B and posses only some ancestors of it. Should it send me only a
> part of the chain or should I better ask a different server?

I'll keep it simple. A chain is defined by one commit head. Such a
chain can't change over time. But you can ask for just part of the
chain, rev-list syntax can be used here. For example if you already
have commits C and D and 10 delta in the chain (linear history for
simplicity here), requesting "give me A~10 ^C ^D" should give required
commits.

> Moreover, in case a directory gets renamed, the content may get transfered
> needlessly. This is probably no big problem.

Yes, the chain constraint can backfire in these cases. We can mix
standard upload-pack/fetch-pack and this if the server can recognize
these cases, by cutting commit history into chunks. The dir rename
chunks can be fetched with git-fetch.

> I haven't read the whole other thread yet, but what about going the other way
> round? Use a single commit as a chain, create deltas assuming that all
> ancestors are already available. The packs may arrive out of order, so the
> decompression may have to wait. The number of commits may be one order of
> magnitude larger than the the number of paths (there are currently 2254 paths
> and 24235 commits in git.git), so grouping consequent commits into one larger
> pack may be useful.

The number of commits can increase fast. I'd rather have a
small/stable number over time. And commits depend on other commits so
you can't verify a commit until you have got all of its parents. That
does apply to file, but then this file chain does not interfere other
file chains.

> The advantage is that the packs stays stable over time, you may create them
> using the most aggressive and time-consuming settings and store them forever.
> You could create packs for single commits, packs for non-overlapping
> consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with
> commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7],
> etc. The reason for this is obviously to allow reading groups of commits from
> different servers so that they fit together (similar to Buddy memory
> allocation). Of course, there are things like branches bringing chaos in this
> simple scheme, but I'm sure this can be solved somehow.

Pack encoding can change. And packs can contain objects you don't want
to share (i.e. hidden from public view).

> Another problem is the client requesting commits A and B while declaring to
> possess commits C and D. When both C and D are ancestors of either A or B, you
> can ignore it (as you assume this while packing, anyway). The other case is
> less probable, unless e.g. C is the master and A is a developing branch.
> Currently. I've no idea how to optimize this and whether this could be
> important.

As I said, we can request just part of a chain (from A+B to C+D).
git-fetch should be used if the repo is quite uptodate though. It's
just more efficient.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html