Re: git pack/unpack over bittorrent - works!

Nicolas Pitre <nico@xxxxxxxxxxx> · Sat, 04 Sep 2010 00:39:56 -0400 (EDT)

On Sat, 4 Sep 2010, Artur Skawina wrote:

> Hmm, taking a few steps back, what is the expected usage of git-p2p?
> Note it's a bit of a trick question; what i'm really asking is what _else_,
> other than pulling/tracking Linus' kernel tree will/can be done with it?

Dunno.

> Because once you accept that all peers are equal, but some peers are more
> equal than others, deriving a canonical representation of the object store
> becomes relatively simple.

That depends what you consider a canonical representation.  I don't 
think the actual object store should ever be "canonicalized".

> Then, it's just a question of fetching the missing
> bits, whether using a dumb (rsync-like) transport, or a git-aware protocol.

But Git does that already.

> (I've no idea why you'd want to base a transfer protocol on the unstable packs,
> building it on top of objects seems to be the only sane choice)

There seems to be quite some confusion around objects and packs.

The Git "database" is _only_ a big pile of objects that is content 
addressable i.e. each object has a name which is derived from its 
content.  This is the 40 hexadecimal string.

There are only 4 types of objects. Roughly they are:

1) A "blob" object contains plain data, usually used for file content.

2) A "tree" object contains a list of entries made of a file or 
   directory name, and the object name that corresponds to it.  For 
   files, the referenced objects are "blobs". For directories, the 
   referenced objects are some other "trees".  This is how the file and 
   directory hierarchy are represented.

3) A "commit" object contains a reference to the top tree object 
   corresponding to the root directory of the project, a reference to 
   the previous "commit" object, and a text message to describe this 
   commit.  If this commit represents a merge, then there will be more 
   than one reference to previous commits.  This is how the commit 
   history is represented.

4) And finally a "tag" object contains a reference to any other object 
   and a text message.  Most of the time, only commit objects are 
   referenced that way.  This is used to identify some particular 
   commits.

And finally, there are a few files, one for each "branch", used to 
contain a reference to the latest commit object for each of those 
branches.

That's it!  Here you have the *whole* architecture of Git!

Now... one way to store those objects on disk is to simply deflate them 
with zlib and put the result in a file, one file per object.  The first 
2 chars from the object name are used to create ssubdirectories under 
.git/objects/ and the remaining 38 chars are used for the actual file 
name within those subdirectories.  This is the "loose" object format or 
encoding.

Another way to store those objects is to cram them together in one or 
multiple (or many) pack files.  The advantage with the pack file is that 
we can encode any object as a delta against any other object in the same 
pack file.  This is the "packed" object format or encoding.

> I'm mostly git-ignorant and i'm assuming the following two things -- if someone
> more familiar w/ git internals could confirm/deny, that would be great:
> 
> 1) "git pull git:..." would (or could be made to) work w/ a client that asks for
>    "A..E", but also tells the server to omit "B,C and D" from the wire traffic.    

What Git does when transferring data on the wire is actually to create a 
special pack file that contains _only_ those objects that the sender has 
but that the receiver doesn't, and stream that over the net.  So if the 
client tells the server that it already has commit A, then the server 
will create a pack that contains only those objects that were created 
after commit A, and omit all the objects that can be reached through 
commit A that are also used by later commits (think unchanged files).  
If you also have commits B, C and D, then the server will also exclude 
all the objects that are reachable through those commits from that 
special pack.

On the receiving end, Git simply writes the received pack into a file 
along with the other existing packs, and compute a pack index for it.

> 2) Git doesn't use chained deltas. IOW given commits "A --d1-> B --d2-> C",
>    "C" can be represented as a delta against "A" or "B", but _not_ against "d1". 
>    (Think of the case where "C" reverts /part of/ "B")

Git does use chained deltas indeed.  But deltas are used only at the 
object level within a pack file.  Any blob object can be represented as 
a delta against any other blob in the pack, regardless of the commit(s) 
those blob objects belong to.  Same thing for tree objects.  So you can 
have deltas going in total random directions if you look them from a 
commit perspective.  So "C" can have some of its objects being deltas 
against objects from "B", or "A", or any other commit for that matter, 
or even objects belonging to the same commit "C". And some other objects 
from "B" can delta against objects from "C" too. There is simply no 
restrictions at all on the actual delta direction.  The only rule is 
that an object may only delta against another object of the same type.

Of course we don't try to delta each object against all the other 
available objects as that would be a O(n^2) operation (imagine with n = 
1.7 million objects).  So we use many heuristics to make this delta 
packing efficient without taking an infinite amount of time.

For example, if we have objects X and Y that need to be packed together 
and sent to a client over the net, and we find that Y is already a delta 
against X in one pack that exists locally, then we simply and literally 
copy the delta representation of Y from that local pack file and send it 
out without recomputing that delta.

> Then there are security implications... Which pretty much mandate having "special"
> peers anyway, at least for transferring heads (branches/tags etc). Which means
> the second paragraph above applies.

Well... Actually, all you need is only one trusted peer to provide those 
heads i.e. the top commit SHA1 name for each branches you need.  From 
that one SHA1 name per branch, you can validate the entire repository as 
every object reference throughout is based on the content of the object 
it refers to.  For example, to validate the authenticity of everything 
from a random copy of the Linux kernel repository, I need only 20 bytes 
from a trusted source.  No need to have this information distributed 
amongst multiple peers.

And even if the delta encoding is different from the one used in Linus' 
repository, or even if the packing is done differently (different number 
of packs, etc.) then the final SHA1 will always be the same.  This is 
because the actual content from all referenced objects is the same 
regardless of their effective encoding or format.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html