On Thu, 26 Oct 2006, Vincent Ladeuil wrote: > > Ok, so git make a distinction between the commit (code created by > someone) and the tree (code only). > > Commits are defined by their parents. Commits are defined by a _combination_ of: - the tree they commit (which is recursive, so the commit name indirectly includes information EVERY SINGLE BIT in the whole tree, in every single file) - the parent(s) if any (which is also recursive, so the commit name indirectly includes information about EVERY SINGLE BIT in not just the current tree, but every tree in the history, and every commit that is reachable from it) - the author, committer, and dates of each (and committer is actually very often different from author) - the actual commit message So a commit really names - uniquely and authoratively - not just the commit itself, but everything ever associated with it. > Trees are defined by their content only ? Where "contents" does include names and permissions/types (eg execute bit and symlink etc). > If that's the case, how do you proceed ? If you compare the commit name, and they are equal, you automatically know - the trees are 100% identical - the histories are 100% identical If you only care about the actual tree, you compare the tree name for equality, ie you can do git-rev-parse commit1^{tree} commit2^{tree} and compare the two: if and only if they are equal are the actual contents 100% equal. > Calculate a sha1 representing the content (or the content of the > diff from parent) of all the files and dirs in the tree ? Or > from the sha1s of the files and dirs themselves recursively based > on sha1s of the files and dirs they contain ? The latter. > I ask because the later seems to provide some nice effects > similar to what makes BDD > (http://en.wikipedia.org/wiki/Binary_decision_diagram) so > efficient: you can compare graphs of any complexity or size in > O(1) by just comparing their signatures. This is exactly what git does. You can compare entire trees (and subdirectories are just other trees) by just comparing 20 bytes of information. How do you think we can do a diff between two arbitrary kernel revisions so fast? Why do you think we can afford to do a git log drivers/usb include/linux/usb* that literally picks out the history (by comparing state) for every commit in the tree? I can do the above log-generation in less than ten _seconds_ for the last year and a half of the kernel. That's 20k+ lines of logs of commits that only touch those files and directories. And I _need_ it to be fast, because that's literally one of the most common operations I do. And the reason it's fast is that we can compare 20,000 files (names, contents, permissions) by just comparing a _single_ 20-byte SHA1. In git, revision names (and _everything_ has a revision name: commits, trees, blobs, tags) really have meaning. They're not just random noise. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html