Re: Resumable clone/Gittorrent (again)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 6, 2011 at 1:47 AM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
> On Thu, Jan 6, 2011 at 1:07 AM, Luke Kenneth Casson Leighton
> <luke.leighton@xxxxxxxxx> wrote:
>> Âthe plan is to turn that variation in the git pack-objects responses,
>> across multiple peers, into an *advantage* not a liability. Âhow?
>> like this:
>>
>> Â* a client requiring objects from commit abcd0123 up to commit
>> efga3456 sends out a DHT broadcast query to all and sundry who have
>> commits abcd0123 and everything in between up to efga3456.
>>
>> Â* those clients that can be bothered to respond, do so [refinements below]
>>
>> Â* the requestor selects a few of them, and asks them to create git
>> pack-objects. Âthis takes time, but that's ok. Âonce created, the size
>> of the git pack-object is sent as part of the acknowledgement.
>>
>> Â* the requestor, on receipt of all the sizes, selects the *smallest*
>> one to begin the p2p (.torrent) from (by asking the remote client to
>> create a .torrent specifically for that purpose, with the filename
>> abcd0123-ebga3456).
>
> That defeats the purpose of distributing. You are putting pressure on
> certain peers.

 that's unavoidable, but it's not actually as bad as it seems.  think
about it.  normally, "pressure" is put onto a git server, by forcing
that server to perform multiple "git pack-object" calculations,
repeatedly, for each and every "git pull".

 so, the principle behind this RFC (is it an RFC? yes, kinda...) is
that a) you cache those git pack-objects, thus avoiding heavy CPU
usage b) you make the requests to _many_ peers that you'll likely find
already are in the process of distributing that particular
commit-range _anyway_ so will _definitely_ have it  ... etc. etc.

 so there's a ton of reasons why it's quite a big improvement over the
present star-network arrangement.

>
>> Ânow, an immediately obvious refinement of this is that those .torrent
>> (pack-objects) "stick around", in a cache (with a hard limit defined
>> on the cache size of course). Âand so, when the client that requires a
>> pack-object makes the request, of course, those remote clients that
>> *already* have that cached pack-object for that specific commit-range
>> should be given first priority, to avoid other clients from having to
>> make massive amounts of git pack-objects.
>
> Cache have its limits too. Suppose I half-fetch a pack then stop and
> go wild for a month. The next month I restart the fetch, the pack may
> no longer in cache. A new pack may or may not be identical to the old
> pack.

 correct.  that's not in the slightest bit a problem.  the peer which
has that new pack will be asked to make a new .torrent for _that_
pack.  with a new name that uniquely identifies it (the md5sum of the
pack would do as the .torrent filename)

> Also if you go with packs, you are tied to the peer that generates
> that pack. Two different peers can, in theory, generate different
> packs (in encoding) for the same input.

 yes.  correct.  i _did_ say that you pick the one that is the
smallest of the two (or three.  or 10).  in this way you actually do
much better than you would otherwise in a "star network" such as a
standard HTTP git server, because you've asked 2, 3 or 10 (whatever)
peers, and you'll end up with _the_ most efficient representation of
that commit-range.  statistically speaking, of course :)


> Another thing with packs (ok, not exactly with packs) is how you
> verify that's you have got what you asked.

 ok - how do you verify that you've got what you asked, when you ask
from a git server using HTTP?

> Bittorrent can verify every
> piece a peer receives because sha-1 sum of those pieces are recorded
> in .torrent file.

 yes.  this is simply a part of the bittorrent protocol, to ensure
that the file being transferred is correctly transferred.

 these verifications steps should be _trusted_ and should _not_ be
confused with anything else (i've deleted the rest of the paragraph
you wrote, in order to reduce any opportunity for confusion).

 if you mentally keep git separate from bittorrent it helps.  imagine
that bittorrent is merely a drop-in replacement for git over HTTP
(nicolas kindly explained the plugin system for git which would add
another protocol for downloading of git repos, and yes this can all be
implemented as a plugin)


>> so, can you see that a) this is a far cry from the "simplistic
>> transfer of blobs and trees" b) it's *not* going to overload peoples'
>> systems by splattering (eek!) millions of md5 sums across the internet
>> as bittorrent files c) it _does_ fit neatly into the bittorrent
>> protocol d) it combines the best of git with the best of p2p
>> distributed networking principles...
>
> How can you advertise what you have to another peer?

 you don't.  it's done "on-demand".

 the concept of "git push" becomes virtually a null-op, updating the
bittorrent tracker and that's... about it.

 it's where "git pull" that all the work is done, starting with that
DHT query [no i know "bittorrent the protocol" doesn't have DHT, but
many bittorrent clients _do_ have DHT, and Tribler has an extremely
good one].

 l.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]