Re: Resumable clone/Gittorrent (again)

Luke Kenneth Casson Leighton <luke.leighton@xxxxxxxxx> · Wed, 5 Jan 2011 18:07:19 +0000

On Wed, Jan 5, 2011 at 5:13 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote:
> Luke Kenneth Casson Leighton wrote:
>> Ânow that of course leaves you with the problem that you now have
>> potentially hundreds if not thousands or tens of thousands of
>> .torrents to deal with, publish, find etc. etc.
>
> Umm, I'm counting 202400 objects in my git.git and 1799525 in a clone
> of linux-2.6.git. ÂSo I'm not sure how far you want to split things
> into single transfers, but going all the way down to objects will
> massively hurt performance.

 yeah... this is a key reason why i came up with a protocol which
transferred the exact same pack-objects that HTTP and all the other
"point-to-point" git protocols use, to such good effect.

the problem was that i was going to rely on multiple clients being
able to genereate the exact same pack-object, given the exact same
input, and then share that pack-object.  ok, that's not the problem,
that was just the plan :)

 nicolas kindly pointed out, at some length, that in a distributed
environment, however, that plan was naive, becauuuse whenever you
request a pack-object for use e.g. normally with HTTP or other git
point-to-point protocol, it's generated there-and-then using
heuristics and multi-threading that pretty much guarantees that even
if you were to make the exact same request of exactly the same system,
you'd get *different* pack-objects!  not to mention the fact that
different people have the same git objects stored in *different* ways
because the object stores, despite having the same commits in them,
were pulled at different times and end up with a completely different
set of git objects that represent those exact same commits that
everyone else has.

that's all a bit wordy, but you get the idea.

 so, nicolas recommended a "simpler" approach, which, well, apologies
nicolas but i didn't really like it - it seemed far too simplistic and
i'm not really one for spending time doing these kinds of
"intermediate baby steps" (wrong choice of words, no offense implied,
but i'm sure you know what i mean).  i much prefer to just hit all the
issues head-on, right from the start :)

so, in the intervening time since this was last discussed i've given
the pack-objects-distributing idea some thought (and NO, nicolas, just
to clarify, this is NOT grabbing the git packed objects that are
actually in the .git/objects store, so NO, this does NOT end up
bypassing security by giving people objects that are from another
branch, it really IS getting that lovely varying data which is
heuristic, store and threadnum dependent!).

 the plan is to turn that variation in the git pack-objects responses,
across multiple peers, into an *advantage* not a liability.  how?
like this:

 * a client requiring objects from commit abcd0123 up to commit
efga3456 sends out a DHT broadcast query to all and sundry who have
commits abcd0123 and everything in between up to efga3456.

 * those clients that can be bothered to respond, do so [refinements below]

 * the requestor selects a few of them, and asks them to create git
pack-objects.  this takes time, but that's ok.  once created, the size
of the git pack-object is sent as part of the acknowledgement.

 * the requestor, on receipt of all the sizes, selects the *smallest*
one to begin the p2p (.torrent) from (by asking the remote client to
create a .torrent specifically for that purpose, with the filename
abcd0123-ebga3456).

 in this way you end up with not only an efficient git pack-object but
you get, to 99.5% certainty *THE* most efficient git pack-object.
distributed computing at its best :)

 now, an immediately obvious refinement of this is that those .torrent
(pack-objects) "stick around", in a cache (with a hard limit defined
on the cache size of course).  and so, when the client that requires a
pack-object makes the request, of course, those remote clients that
*already* have that cached pack-object for that specific commit-range
should be given first priority, to avoid other clients from having to
make massive amounts of git pack-objects.

 a further refinement is of course to collect statistics on the number
of peers doing downloads at the time, prioritising those pack-objects
which are most being distributed at the time.  this has fairly obvious
benefits :)

 yet *another* refinement is slightly less obvious, and it's this:
there *COULD* happen to be some existing pack-objects in the cache,
not of commit abcd0123-efga3456 but in a ready-made "chain": commits
abcd01234-beef7890 packed already and in the cache, and commits
beef7890-efga3456 likewise packed already and in the cache.  again:
the requestor should be informed of these, and make their mind up as
to what to do.

 it gets rather more complex when you have *part* of the chain already
pre-cached (and have to work out err, i got this bit and this bit, but
i'd have to generate a git pack-object for the bit in the middle, i'll
inform the requestor of this, they can make up their mind what to do),
but again i do not imagine for one second that this would be anything
more than an intriguing coding challenge and, importantly, an
optimisation challenge for gittorrent version 3.0 somewhere down the
line, rather than an all-out absolute requirement that it must, must
be done now, now, now.

 what else can i mention, that occurred to me... yeah - abandoning of
a download.  if, for some reason it becomes blindingly obvious that
the p2p transfer just isn't working out, then the requestor simply
stops the process and starts again.  a refinement of this, which is a
bit cheeky i know, is to keep *two* simultaneous requests and
downloads for the *exact* same git pack-object commit-chain but with
different data from different groups of peers, for a short period of
time, and then abandon one of them once it's clear which one is best.
this does seem a bit cheeky, but it has the advantage that if the one
that _was_ fastest goes tits-up, you can at least go back to the
previous one and, assuming that the cache hasn't been cleared, just
join in again.  but this is _really_ something that's wayyy down the
line, for gittorrent version 4.0 or 5.0 or so.

so, can you see that a) this is a far cry from the "simplistic
transfer of blobs and trees" b) it's *not* going to overload peoples'
systems by splattering (eek!) millions of md5 sums across the internet
as bittorrent files c) it _does_ fit neatly into the bittorrent
protocol d) it combines the best of git with the best of p2p
distributed networking principles...

... all of which creates a system which people will _still_ say is a
"hammer looking for nails" :)

... right up until the point where some idiot in the USA government
decides to seize sourceforge.net, github.com, gitorious.org and
savannah.gnu.org because they contain source code of software that
MIGHT be used for copyright infringement.  whilst i realise that the
only one of those that might be missed is sourceforget, you cannot
ignore the fact that the trust placed in governments and large
corporations to run the internet infrastructure is now completely
gone, and that the USA and other countries are now putting in place
hypocritical policies that put them into the same category that used
to be reserved for China, Saudi Arabia, Iran and other regimes accused
of being "Totalitarian".

 thoughts, anyone?  (other than on the last paragraph, please, if that's ok).

l.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html