Re: git pack/unpack over bittorrent - works!

Theodore Tso <tytso@xxxxxxx> · Fri, 3 Sep 2010 21:57:10 -0400

On Sep 3, 2010, at 3:41 PM, Nicolas Pitre wrote:

> 
> Let's see what such instructions for how to make the canonical pack 
> might look like:

But we don't need to replicate any particular pack.  We just need to provide instructions that can be replicated everywhere to provide *a* canonical pack.

> 
> First you need the full ordered list of objects.  That's a 20-byte SHA1
> per object.  The current Linux repo has 1704556 objects, therefore this
> list is 33MB already.

Assume the people creating this "gitdo" pack (i.e., much like jigdo) have a superset of Linus's objects.  So if we have all of the branches in Linus's repository, we can construct all of the necessary objects going back in time to constitute his repository.   If Linus has only one branch in his repo, we only need a single 20-byte SHA1 branch identifier.   For git, presumbly we would need three (one for next, maint, and master).

What abort the order of the objects in the pack?  Well, ordering doesn't matter, right?   So let's assume the pack is sorted by hash id.   Is there any downside to that?  I can't think of any, but you're the pack expert...

If we do that, we would thus only need to send 20 bytes instead of 33MB.  

> Then you need to identify which of those objects are deltas, and against
> which object.  Assuming we can index in the list of objects, that means,
> say, one bit to identify a delta, and 31 bits for indexing the base. In
> my case this is currently 1393087 deltas, meaning 5.3 MB of additional
> information.

OK, this we'll need which means 5.3MB.

> 
> But then, the deltas themselves can have variations in their encoding.
> And we did change the heuristics for the actual delta encoding in the
> past too (while remaining backward compatible), but for a canonical pack
> creation we'd need to describe that in order to make things totally
> reproducible.
> 
> So there are 2 choices here: Either we specify the Git version to make 
> sure identical delta code is used, but that will put big pressure on 
> that code to remain stable and not improve anymore as any behavior 
> change will create a compatibility issue forcing people to upgrade their 
> Git version all at the same time.  That's not something I want to see 
> the world rely upon.

I don't think the choice is that stark.  It does mean that in addition to whatever pack encoding format is used by git natively, the code would also need to preserve one version of the delta hueristics for "Canonical pack version 1". After this version is declared, it's true that you might come up with a stunning new innovation that saves some disk space.  How much is that likely to be?  3%?  5%?   Worst case, it means that (1) the bittorent-distributed packs might not be as efficient, and (2) the code would be made more complex because we would either need to (a) keep multiple versions of the code, or (b) the code might need to have some conditionals:

	if (canonical pack v1)
		do_this_code;
	else
		do_this_more_clever_code;

Is that really that horrible?  And certainly we should be able to set things up so that it won't be a brake on innovation...

> 
> The other choice is to actually provide the delta output as part of the 
> instruction for the canonical pack creation.
> 
> So that makes for a grand total of 33 MB + 148 MB = 181 MB of data just
> to be able to unambiguously reproduce a pack with a full guarantee of
> perfect reproducibility.

So if we use the methods I've suggested, we would only need to send 5.3MB instead of 33MB or 181MB....

> 
> But even with the presumption of stable delta code, the recipee would 
> still take 38 MB that everyone would have to download every month which 
> is far more than what a monthly incremental update of a kernel repo 
> requires.  Of course you could create a delta between consecutive 
> recipees, but that is becoming rather awkward.

The "recipee" would only need to download this if they are willing to participate as being one of the "seeders" in the BitTorrent network.   People who are willing to do this are presumably willing to transmit many more megabytes of data than 5MB or 33MB or 181MB.  Given the draconian policies of various ISP such as Comcast, it's not clear to me how many people will be willing to be seeders.   But if they are, I don't think downloading 5.3MB of instructions to generate a 600MB canonical pack to be distributed to hundreds or thousands of strangers will stop them.  :-)

> I still think that if someone really want to apply the P2P principle à 
> la BitTorrent to Git, then it should be based on the distributed 
> exchange of _objects_ as I outlined in a previous email, and not file 
> chunks like BitTorrent does.  The canonical Git _objects_ are fully 
> defined, while their actual encoding may change.

The advantages of sending a canonical pack is that it's relatively less code to write, since we can reuse the standard BitTorrent clients and servers to transmit the git repository.  The downsides are that it's mainly useful for downloading the entire repository, but I think that's the most useful place for peer2peer anyway.

The advantage of a distributed exchange of _objects_ is you can use it to update a random repository --- but normally it's so efficient to download an incremental set of objects from github or kernel.org, so I'm not sure what would be the point.   It would also require exchanging a lot more metadata, since presumably the client would first have to receive all of the object id's (which would be 33MB), and then use that to decide how to distribute asking for those objects from his/her peers.   Which means sending object lists to different servers.   If these peers do not yet have a complete set of objects, they'll have to nack some of the object requests.   Furthermore with only a partial set of objects downloaded, how will the client do the delta compression?   Which means the client will need to store all of the objects in a decompressed form, and only when it has received all of the objects  will it be able to compress them.   So it's going to be fairly inefficient in terms of disk space.  I suppose the server could send the delta information (another 5.3MB) and then the client could use that to prioritize its object request lists.   But still, this is quite different form the git protocol, and a lot will have to be written from scratch.

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html