Re: Calculating tree nodes

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Tue, 4 Sep 2007 01:50:21 -0400

On 9/4/07, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> "Jon Smirl" <jonsmirl@xxxxxxxxx> writes:
>
> >> Yes.  For performance reasons, since a simple commit would kill you in any
> >> reasonably sized repo.
> >
> > That's not an obvious conclusion. A new commit is just a series of
> > edits to the previous commit. Start with the previous commit, edit it,
> > delta it and store it. Storing of the file objects is the same. Why
> > isn't this scheme fast than the current one?
>
> I think you seem to be forgetting about tree comparison.
>
> With a large project that has a reasonable directory structure
> (i.e. not insanely narrow), a commit touches isolated subparts
> of the whole tree.  Think of an architecture specific patch to
> the Linux kernel touching only include/asm-i386 and arch/i386
> directories.
>
> Being able to cull an entire subdirectory (e.g. drivers/ which
> has 5700 files underneath) by only looking at the tree SHA-1 of
> the containing tree is a _HUGE_ win.

In my scheme you have all of the SHAs for the commit in RAM because
the are contained in the commit and you have the commit in RAM. It
take microseconds to compare these two lists in RAM.

The current scheme is doing disk accesses to get those tree nodes so
of course it is a win to cull the 5700 files.

>
> And this is not just about two tree comparison.  When you say:
>
>         git log v2.6.20 -- arch/i386/
>
> what you are seeing is a simplified history that consists of
> commits that touch only these paths.  How would we determine if
> a commit touch these paths efficiently?  By comparing the "i386"
> entry in tree objects for $commit^:arch and $commit:arch.  You
> do not have to look inside arch/i386/ trees to see if any of the
> 330 files in it is different.  You just check a single SHA-1
> pair.

It's more than just comparing a SHA, you have to do disk accesses to
retrieve the SHA.

I'm proposing that we only really need commit and file objects. I also
mentioned that if you think of the file objects as a table you could
use triggers to build cached indexes. To get performance back to the
current level we may want to construct some of these indexes. We need
to explore the scheme more before we can figure out the best cached
indexes to build.

Right now we only have a single index type, the tree nodes. And it's a
permanent part of the storage not cached. A hierarchical index is not
very useful of indexing non file name attributes.

-- 
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html