On Fri, 3 Sep 2010, Ted Ts'o wrote: > On Fri, Sep 03, 2010 at 10:12:29AM -0700, Junio C Hamano wrote: > > Theodore Tso <tytso@xxxxxxx> writes: > > > > > ... So people who are willing > > > to participate as part of the peer2peer network can download the > > > instructions for how to make the canonical pack once a month, and use it > > > to create the canonical pack. If the "Gittorrent master" has spent a > > > lot of time to carefully compute the most efficient set of delta > > > pairings, they will get the slight benefit of a more efficient pack > > > which they could use instead of th eir local one without having to use > > > large values of --window and --depth to "git repack". > > > > Hmm, is the idea essentially to tell people "Here is a snapshot of Linus > > repository as of a few weeks ago, carefully repacked. Instead of running > > "git clone" yourself, please bootstrap your repository by copying it over > > bittorrent and then "git pull" to update it"? > > Essentially, yes. I just don't think bittorrent makes sense for > anything else, because the git protocol is so much more efficient for > tiny incremental updates... > > So the only other part of my idea is that we could construct a special > set of instructions that would allow them to recreate the carefully > repacked snapshot of Linus's repository without having to download it > from a central seed site. Instead, they could download a small set of > instructions, and use that in combination with the objects already in > their repository, to create a bit-identical version of the carefully > repacked Linus repository. It's basically rip-off of jigdo, but > applied to git repositories instead of Debian .iso files. Small? Well... Let's see what such instructions for how to make the canonical pack might look like: First you need the full ordered list of objects. That's a 20-byte SHA1 per object. The current Linux repo has 1704556 objects, therefore this list is 33MB already. Then you need to identify which of those objects are deltas, and against which object. Assuming we can index in the list of objects, that means, say, one bit to identify a delta, and 31 bits for indexing the base. In my case this is currently 1393087 deltas, meaning 5.3 MB of additional information. But then, the deltas themselves can have variations in their encoding. And we did change the heuristics for the actual delta encoding in the past too (while remaining backward compatible), but for a canonical pack creation we'd need to describe that in order to make things totally reproducible. So there are 2 choices here: Either we specify the Git version to make sure identical delta code is used, but that will put big pressure on that code to remain stable and not improve anymore as any behavior change will create a compatibility issue forcing people to upgrade their Git version all at the same time. That's not something I want to see the world rely upon. The other choice is to actually provide the delta output as part of the instruction for the canonical pack creation. In my case, the delta output represents: $ git verifi-pack -v .git/objects/pack/*.pack | \ awk --posix '/^[0-9a-f]{40}/ && $6 { tot += 1; size += $4 } \ END { print tot, size }' 1393087 155022247 We therefore have 148 MB of purely delta data here. So that makes for a grand total of 33 MB + 148 MB = 181 MB of data just to be able to unambiguously reproduce a pack with a full guarantee of perfect reproducibility. But even with the presumption of stable delta code, the recipee would still take 38 MB that everyone would have to download every month which is far more than what a monthly incremental update of a kernel repo requires. Of course you could create a delta between consecutive recipees, but that is becoming rather awkward. I still think that if someone really want to apply the P2P principle à la BitTorrent to Git, then it should be based on the distributed exchange of _objects_ as I outlined in a previous email, and not file chunks like BitTorrent does. The canonical Git _objects_ are fully defined, while their actual encoding may change. Nicolas