Re: VCS comparison table

Linus Torvalds <torvalds@xxxxxxxx> · Thu, 26 Oct 2006 08:05:36 -0700 (PDT)

On Thu, 26 Oct 2006, Vincent Ladeuil wrote:
> 
> Ok, so git make a distinction between the commit (code created by
> someone) and the tree (code only).
> 
> Commits are defined by their parents.

Commits are defined by a _combination_ of:
 - the tree they commit (which is recursive, so the commit name indirectly 
   includes information EVERY SINGLE BIT in the whole tree, in every 
   single file)
 - the parent(s) if any (which is also recursive, so the commit name 
   indirectly includes information about EVERY SINGLE BIT in not just the 
   current tree, but every tree in the history, and every commit that is 
   reachable from it)
 - the author, committer, and dates of each (and committer is actually 
   very often different from author)
 - the actual commit message

So a commit really names - uniquely and authoratively - not just the 
commit itself, but everything ever associated with it.

> Trees are defined by their content only ?

Where "contents" does include names and permissions/types (eg execute bit 
and symlink etc).

> If that's the case, how do you proceed ? 

If you compare the commit name, and they are equal, you automatically know
 - the trees are 100% identical
 - the histories are 100% identical

If you only care about the actual tree, you compare the tree name for 
equality, ie you can do

	git-rev-parse commit1^{tree} commit2^{tree}

and compare the two: if and only if they are equal are the actual contents 
100% equal.

> Calculate a sha1 representing the content (or the content of the
> diff from parent) of all the files and dirs in the tree ?  Or
> from the sha1s of the files and dirs themselves recursively based
> on sha1s of the files and dirs they contain ?

The latter. 

> I ask because the later seems to provide some nice effects
> similar to what makes BDD
> (http://en.wikipedia.org/wiki/Binary_decision_diagram) so
> efficient: you can compare graphs of any complexity or size in
> O(1) by just comparing their signatures.

This is exactly what git does. You can compare entire trees (and 
subdirectories are just other trees) by just comparing 20 bytes of 
information.

How do you think we can do a diff between two arbitrary kernel revisions 
so fast? Why do you think we can afford to do a 

	git log drivers/usb include/linux/usb*

that literally picks out the history (by comparing state) for every commit 
in the tree?

I can do the above log-generation in less than ten _seconds_ for the last 
year and a half of the kernel. That's 20k+ lines of logs of commits that 
only touch those files and directories. And I _need_ it to be fast, 
because that's literally one of the most common operations I do.

And the reason it's fast is that we can compare 20,000 files (names, 
contents, permissions) by just comparing a _single_ 20-byte SHA1.

In git, revision names (and _everything_ has a revision name: commits, 
trees, blobs, tags) really have meaning. They're not just random noise.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html