Re: git pack/unpack over bittorrent - works!

Theodore Tso <tytso@xxxxxxx> · Sat, 4 Sep 2010 08:00:40 -0400

On Sep 4, 2010, at 1:40 AM, Nicolas Pitre wrote:
>> What about the order of the objects in the pack?  Well, ordering 
>> doesn't matter, right?  So let's assume the pack is sorted by hash id.  
>> Is there any downside to that?  I can't think of any, but you're the 
>> pack expert...
> 
> Ordering does matter a big deal.  Since object IDs are the SHA1 of their 
> content, those IDs are totally random.  So if you store objects 
> according to their sorted IDs, then the placement of objects belonging 
> to, say, the top commit will be totally random.  And since you are the 
> filesystem expert, I don't have to tell you what performance impacts 
> this random access of small segments of data scattered throughout a 
> 400MB file will have on a checkout operation.

Does git repack optimize the order so that certain things (like checkouts for example) are really fast?  I admit I hadn't noticed.  Usually until the core packs are in my page cache, it has always seemed to me that things are pretty slow.   And of course, the way objects and grouped together and ordered for "gitk" or "git log" to be fast won't be the safe as a checkout operation...

> Sure.  But I don't think it is worth making Git less flexible just for 
> the purpose of ensuring that people could independently create identical 
> packs.  I'd advocate for "no code to write at all" instead, and simply 
> have one person create and seed the reference pack.

I don't think it's a matter of making Git "less flexible", it's just simply a code maintenance headache of needing to be able to support encoding both a canonical format as well as the latest bleeding-edge, most efficient encoding format.   And how often are you changing/improving the encoding process, anyway?  It didn't seem to me like that part fo the code was constantly being tweaked/improved. 

Still, you're right, it might not be worth it.  To be honest, I was more interested about the fact that this might also be used to give people hints about how to better repack their local repositories so that they didn't have to run git repack with large --window and --depth arguments.  But that would only provide very small improvements in storage space in most cases, so it's probably not even worth it for that.

Quite frankly, I'm a little dubious about how critical peer2peer really is, for pretty much any use case.  Most of the time, I can grab the base "reference" tree and drop it on my laptop before I go off the grid and have to rely on EDGE or some other slow networking technology.  And if the use case is some small, but illegal-in-some-jurisdiction code, such as ebook DRM liberation scripts (the kind which today are typically distributed via pastebin's :-), my guess is that zipping up a git repository and dropping it on a standard bittorrent server run by the Swedish Pirate party is going to be much more effective.   :-)

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html