On Wed, Jan 5, 2011 at 5:13 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote: > Luke Kenneth Casson Leighton wrote: >> Ânow that of course leaves you with the problem that you now have >> potentially hundreds if not thousands or tens of thousands of >> .torrents to deal with, publish, find etc. etc. > > Umm, I'm counting 202400 objects in my git.git and 1799525 in a clone > of linux-2.6.git. ÂSo I'm not sure how far you want to split things > into single transfers, but going all the way down to objects will > massively hurt performance. yeah... this is a key reason why i came up with a protocol which transferred the exact same pack-objects that HTTP and all the other "point-to-point" git protocols use, to such good effect. the problem was that i was going to rely on multiple clients being able to genereate the exact same pack-object, given the exact same input, and then share that pack-object. ok, that's not the problem, that was just the plan :) nicolas kindly pointed out, at some length, that in a distributed environment, however, that plan was naive, becauuuse whenever you request a pack-object for use e.g. normally with HTTP or other git point-to-point protocol, it's generated there-and-then using heuristics and multi-threading that pretty much guarantees that even if you were to make the exact same request of exactly the same system, you'd get *different* pack-objects! not to mention the fact that different people have the same git objects stored in *different* ways because the object stores, despite having the same commits in them, were pulled at different times and end up with a completely different set of git objects that represent those exact same commits that everyone else has. that's all a bit wordy, but you get the idea. so, nicolas recommended a "simpler" approach, which, well, apologies nicolas but i didn't really like it - it seemed far too simplistic and i'm not really one for spending time doing these kinds of "intermediate baby steps" (wrong choice of words, no offense implied, but i'm sure you know what i mean). i much prefer to just hit all the issues head-on, right from the start :) so, in the intervening time since this was last discussed i've given the pack-objects-distributing idea some thought (and NO, nicolas, just to clarify, this is NOT grabbing the git packed objects that are actually in the .git/objects store, so NO, this does NOT end up bypassing security by giving people objects that are from another branch, it really IS getting that lovely varying data which is heuristic, store and threadnum dependent!). the plan is to turn that variation in the git pack-objects responses, across multiple peers, into an *advantage* not a liability. how? like this: * a client requiring objects from commit abcd0123 up to commit efga3456 sends out a DHT broadcast query to all and sundry who have commits abcd0123 and everything in between up to efga3456. * those clients that can be bothered to respond, do so [refinements below] * the requestor selects a few of them, and asks them to create git pack-objects. this takes time, but that's ok. once created, the size of the git pack-object is sent as part of the acknowledgement. * the requestor, on receipt of all the sizes, selects the *smallest* one to begin the p2p (.torrent) from (by asking the remote client to create a .torrent specifically for that purpose, with the filename abcd0123-ebga3456). in this way you end up with not only an efficient git pack-object but you get, to 99.5% certainty *THE* most efficient git pack-object. distributed computing at its best :) now, an immediately obvious refinement of this is that those .torrent (pack-objects) "stick around", in a cache (with a hard limit defined on the cache size of course). and so, when the client that requires a pack-object makes the request, of course, those remote clients that *already* have that cached pack-object for that specific commit-range should be given first priority, to avoid other clients from having to make massive amounts of git pack-objects. a further refinement is of course to collect statistics on the number of peers doing downloads at the time, prioritising those pack-objects which are most being distributed at the time. this has fairly obvious benefits :) yet *another* refinement is slightly less obvious, and it's this: there *COULD* happen to be some existing pack-objects in the cache, not of commit abcd0123-efga3456 but in a ready-made "chain": commits abcd01234-beef7890 packed already and in the cache, and commits beef7890-efga3456 likewise packed already and in the cache. again: the requestor should be informed of these, and make their mind up as to what to do. it gets rather more complex when you have *part* of the chain already pre-cached (and have to work out err, i got this bit and this bit, but i'd have to generate a git pack-object for the bit in the middle, i'll inform the requestor of this, they can make up their mind what to do), but again i do not imagine for one second that this would be anything more than an intriguing coding challenge and, importantly, an optimisation challenge for gittorrent version 3.0 somewhere down the line, rather than an all-out absolute requirement that it must, must be done now, now, now. what else can i mention, that occurred to me... yeah - abandoning of a download. if, for some reason it becomes blindingly obvious that the p2p transfer just isn't working out, then the requestor simply stops the process and starts again. a refinement of this, which is a bit cheeky i know, is to keep *two* simultaneous requests and downloads for the *exact* same git pack-object commit-chain but with different data from different groups of peers, for a short period of time, and then abandon one of them once it's clear which one is best. this does seem a bit cheeky, but it has the advantage that if the one that _was_ fastest goes tits-up, you can at least go back to the previous one and, assuming that the cache hasn't been cleared, just join in again. but this is _really_ something that's wayyy down the line, for gittorrent version 4.0 or 5.0 or so. so, can you see that a) this is a far cry from the "simplistic transfer of blobs and trees" b) it's *not* going to overload peoples' systems by splattering (eek!) millions of md5 sums across the internet as bittorrent files c) it _does_ fit neatly into the bittorrent protocol d) it combines the best of git with the best of p2p distributed networking principles... ... all of which creates a system which people will _still_ say is a "hammer looking for nails" :) ... right up until the point where some idiot in the USA government decides to seize sourceforge.net, github.com, gitorious.org and savannah.gnu.org because they contain source code of software that MIGHT be used for copyright infringement. whilst i realise that the only one of those that might be missed is sourceforget, you cannot ignore the fact that the trust placed in governments and large corporations to run the internet infrastructure is now completely gone, and that the USA and other countries are now putting in place hypocritical policies that put them into the same category that used to be reserved for China, Saudi Arabia, Iran and other regimes accused of being "Totalitarian". thoughts, anyone? (other than on the last paragraph, please, if that's ok). l. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html