Re: Resumable clone/Gittorrent (again)

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Fri, 7 Jan 2011 13:34:31 +0700

On Fri, Jan 7, 2011 at 10:21 AM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> How do you actually define your chain? ÂGiven that Git is conceptually
> snapshot based, there is currently no relationship between two blobs
> forming the content for two different versions of the same file. ÂEven
> delta objects are not really part of the Git data model as they are only
> an encoding variation of a given primary object. ÂIn fact, we may and
> actually do have deltas where the base object is not from the same
> worktree file as the delta object itself.
>
> The only thing that
> ties this all together is the commit graph. ÂAnd that graph might have
> multiple forks and merges so any attempt at a linearity representation
> into a chain is rather futile. ÂTherefore it is not clear to me how you
> can define a chain with a beginning and an end, and how this can be
> resumed midway.

There's no need to be linear. OK it's not a chain, but a DAG of
objects that has the same path, in the same structure of commit DAG.

>> We start by fetching all commit contents reachable from a commit tip.
>
> Sure. ÂThis is doable today and is called a shalow clone with depth=1.

I meant only commit objects, no trees nor blobs.

>> This is a chain, therefore resumable.
>
> I don't get that part though. ÂHow is this resumable? ÂThat's the very
> issue we have with a clone.

I assume that all commits are sent in an order that parent commits are
always after the commit in question. We can make a pack of undeltified
commit objects in such order. That would make sure we could recover a
continuous commit DAG from the tip if the pack cannot be sent
completely to client.

We can traverse commit graph we have, and request for a pack of
missing commits to grow the commit DAG until we have all commits.

> I proposed a solution to that already, which is to use
> git-upload-archive for one of the tip commit since the data stream
> produced by upload-archive (once decompressed) is actually
> deterministic. ÂOnce completed, this can be converted into a shalow
> clone on the client side, and can be deepened in smaller steps
> afterwards.

You see, I don't send trees and blobs in this phase. There are three
phases. Phase 1 fetches all commits. Once we have all commits. We can
use them to request packs of trees of the same path. Those packs are
like the commit pack, but deltified. That's phase 2. When we have
enough trees, we can proceed to phase 3: fetching packs of blobs.

>> From there each commit can be
>> examined. Missing trees and blobs will be fetched as chains. Everytime
>> a delta is received, we can recreate the new object and verify it (we
>> should have its SHA-1 from its parent trees/commits).
>
> What if the delta is based on an object from another chain? ÂHow do you
> determine which chain to ask for to get that base?

Chains should be independent. If a chain is based on another chain and
we have not got its base yet (because the other chain is not
completed), we can fetch the base separately. In theory we need to
fetch a version of all paths once for them to become bases. So it's
like a broken down version of git-upload-archive.

>> Because these chains are quite independent, in a sense that a blob
>> chain is independent from another blob chain (but requires tree
>> chains, of course). We can fetch as many as we want in parallel, once
>> we're done with the commit chain.
>
> But in practice, most of those chains will end up containing objects
> which are duplicate of objects in another chain. ÂHow do you tell the
> remote that you want part of a chain because you've got 96% of it in
> another chain already?

Because all clients should have full commit graph (without trees and
blobs) before doing anything, they should be able to specify a rev
list for the chain they need. So if you only need SHA1~76..SHA1~100 of
a path, say so to remote side. SHA-1 must be one of the refs on remote
side, so it can parse the syntax and verify quickly if "SHA1~76" is
reachable/allowed to transfer.

>> The last thing I like about these chains is that the number of chains
>> is reasonable. It won't increase too fast over time (as compared to
>> the number of commits). As such it maps well to BitTorrent's "pieces".
>
> My problem right now is that I don't see how this maps well to Git.

Git sees a repository as history of snapshots. This way I see it as a
bunch of "git log -- path", not that bad.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html