Re: Calculating tree nodes

Andreas Ericsson <ae@xxxxxx> · Tue, 04 Sep 2007 07:51:54 +0200

Jon Smirl wrote:
On 9/4/07, Martin Langhoff <martin.langhoff@xxxxxxxxx> wrote:
On 9/4/07, Jon Smirl <jonsmirl@xxxxxxxxx> wrote:
Yes.  For performance reasons, since a simple commit would kill you in any
reasonably sized repo.
That's not an obvious conclusion. A new commit is just a series of
Hi Jon!

If you search the archives you'll find Linus explaining that the
initial git had all the directory structure in one single "tree"
object that held all the paths, not matter how deep. The problem with
that was taht every commit generated a huge new tree object, so he
switched to the current "nested trees" structure, which also has the
nice feature of speeding up diffs/merges if whole subtrees haven't
changed.

In my scheme the commit is only a list of SHA's. The paths are stored
as attributes of the file objects. Commits are just edits to the list
of SHA's in the commit objects. If these lists are kept sorted, then
the delta should be tiny. Just the info on the adds/deletes to the
list.

It will stop being fast when you need to apply (revisions*avg_num_files_changed)
patches before you can start diffing things properly.

This is very different that a single tree blob that contains all of the paths.

Diffing two trees in the scheme is quite fast. Just get their commit
objects into RAM and compare the lists of SHAs.

That's not a very useful diff though. I'd run, screaming, from an SCM that didn't
tell me *how* things have changed in addition to *what*.

edits to the previous commit. Start with the previous commit, edit it,
delta it and store it. Storing of the file objects is the same. Why
isn't this scheme fast than the current one?
I think you're a bit confused between 2 different things:

 - git is _snapshot_ based, so every commit-tree-blob set is
completely independent. The "canonical" storage is each of those
gzipped in .git/objects
 - however, for performance and on-disk-footprint, we delta them (very
efficiently I hear)

The systems are essential the same with a little reorganization. In
the current system the paths and SHA for a commit are spread over the
tree nodes.

In my scheme the path info is moved into the file object nodes and the
SHA list is in the commit node.

git still works exactly as it has before. I just moved things around
in the storage system. The only thing that should be impacted is
performance.

Perhaps, but negatively so. Git is fast when applying patches, primarily
because it can exclude entire subtrees. It knows it can exclude those
subtrees because their SHA1 hashes are identical. It wouldn't know that
if there weren't the tree objects (well, it could, but walking all the
commits, counting changes and considering '0' to be "no changes" doesn't
scale, so that's a moot point).

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html