Re: git pack/unpack over bittorrent - works!

Artur Skawina <art.08.09@xxxxxxxxx> · Sat, 04 Sep 2010 19:23:38 +0200

On 09/04/10 17:56, Bernhard R. Link wrote:
> While your approach looks like it could work with one commonly looked
> for branch/tag, it might be worthwhile to look at some more complicated
> cases and see if the protocol can be extended to also support those.
> 
> Assume you are looking to update abranch B built on top of some
> say arm-specific branch A based on top of the main-line kernel L.
> 
> There you have 6 types of peers:
> 
> 1) those that have the current branch B
> 2) those that have an older state of B
> 3) those that have the current branch A
> 4) those that have an older state of A
> 5) those that have the current branch L
> 6) those that have an older state of L.
> 
> Assuming you want a quite obscure branch, type 1 and 2 peers will not
> be that many, so would be nice if there was some way to also get stuff
> from the others.
> 
> Peers of type 6 do not interest you, as there will be enough of type 5.
> 
> But already peers of type 3 might not be much and even less of type 1
> so types 2 and 4 get interesting.
> 
> If you get first a tree of all commit-ids you still miss (you only
> need that information once and every peer of type 1 can give it to you)
> and have somehow a way to look at the heads of each peer, it should be
> streight forward to split the needed objects using your way (not specifying
> the head you have but the one you hope to get from others), they should be
> able to send you a partial pack in the way you describe.

I doubt git-p2p would work well w/ any "quite obscure branch";
obviously, the p2p approach works best for popular content...

But there's the case of _new_ content, that has not already propagated
thrugh the swarm. And this was the very reason I chose to have 'ref' in
the protocol.

Let's say Linus releases a new kernel, I want to fetch it, and find 200
peers of which only three have already updated.
Here another field in the initial UDP protocol comes in, that i omitted
in the original description in order to keep things simple (it's just
an optimization).
The UDP response from the peer, in addition to the latest commit_id that
it has, also contains an integer "commits_ahead" that says how many
commits ahead of the _my_ id (that I've sent in the request) this peer is.
So in the above situation I immediately find out that eg I'm missing 2000
commits. It would be extremely stupid to try to update using just the
three seeds, of course. 
But by looking at all the commit_aheads of all peers I can see
(make an educated guess, really) that 150 of them already have 1500 of
the new commits. So what i do is pick one of the IDs given by one of the
peers, such that a sufficient number of other sources are likely to
already have it (ie they all have a lower 'commit_ahead'). And start
fetching that commit from the 150 sources, instead of my real target commit.
Once I'm done with this, I'll repeat the process again; by now hopefully
more peers have already updated. If the target commit is still rare I can
again look for an intermediate commit, that has a sufficient numbers of
seeds. 
No extra traffic, and it also discourages abuse, as leeching from just the
few seeds will likely result in overall slower download.

This works best if the ref isn't rewound, obviously, but i think that would
be the common case.

Does this address at least part of your concerns above? The 'obscure branch'
case I'm not sure is easily solvable; I don't know if having a lot of rarely
used content widely distributed in the swarm would be a great idea...

One other interesting case is reusing objects needed by (large) merges, that
could already be available in the cloud. I'm not sure it happens often enough
to care though...

artur
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html