Re: More precise tag following

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Sat, 27 Jan 2007 03:01:26 -0500

Junio C Hamano <junkio@xxxxxxx> wrote:
> What if (I know, this discussion does not belong here until
> 1.5.0 final) we had a "reverse" database that keeps track of
> what tag references which object, and "git rev-list" knows how
> to exploit it?

It'd be useful.  In as many ways as you suggest.  But it also
would be downright difficult to transfer between repositories,
as you have already pointed out in a public forum to yourself. :-)

>  - If a single-path following turns out to be too expensive
>    (there was a longstanding talk about "git log --single-follow
>    $path"; "git blame" also follows a single path although the
>    target it follows can fork into two or more when following
>    cut&pastes) because we need to explode multi-level trees for
>    each commit while traversing commit ancestry, we could define
>    an annotation to a commit that lists the set of paths the
>    commit touches relative to each of its parents (so the object
>    contains N lists of paths), so that pathspec limiting needs
>    to open and read only one object to figure out that the trees
>    do not have to be opened to skip the commit and/or a merge
>    can be simplified.

_THIS_ is worth doing.  I've been having a lot of discussion on
#git with Simon 'corecode' Schubert and Chris Lee about how poorly
git-blame performs compared to its counterpart in Subversion.

Basically Subversion is storing file-level revision ancestry data
within its commit data, allowing it to quickly skip back through
the commits which modified that file.

Git doesn't have this information and must instead traverse the
entire DAG, at least until the file was initially added to the tree.
With 440,000+ revisions and a file which exists in nearly all
of them, getting anything from "git-blame" or "git-log -- foo.c"
takes ages.

Based on some (limited) profiling with Shark it seems we spend about
50% of our CPU time doing zlib decompression of objects and almost
another 14% parsing the tree objects to apply the path limiter.
The only way to speed up blame/log operations is to reduce the number
of decompressions we need to do to the bare minium, and maybe also
reduce the tree parsing overheads.  Do that and we can maybe drop
the running time to 1/4th the current time.

One idea Simon and I were talking about was to store a reverse
file/tree-level DAG in the header of each tree/blob object in the
pack file.  This additional header would be something like a list
of triplets:

  (descendant-commit, ancestor-commit, ancestor-object)

where:

  descendant-commit: the "current" commit being looked at.
  ancestor-commit:   the commit which descendant-commit
                     derives from (directly or indirectly)
  ancestor-object:   prior version (descendant-commit^:path)

This triplet would probably be encoded with descendant-commit using
OBJ_REF, ancestor-commit being an OBJ_OFS style back reference within
the pack (or OBJ_REF if not in this pack) and ancestor-object would
also be an OBJ_REF.  So a triplet probably would wind up costing
~60 bytes.

Triplets would only be stored if ancestor-object != this-object, so
basically only for the commits which changed the path the tree/blob
is occupying in descendant-commit.

Finding the prior revision of a tree or file would be a matter of
finding the triplet which matches the current commit, then jumping
through to the ancestor-* values.  If no triplet matches the current
commit then we peel back the parents of the current commit and try
again with those.  Worst case we do what we do now, which is walk
the DAG.  ;-)

This of course penalizes objects which don't ever change, as we'd
have to walk back a good chunk of the DAG before we find a matching
triplet.  But I would suspect that files which never change are
also not given to log/blame very often either.  And once we do find
a triplet, we can skip through the DAG in time proportional to the
rate of change for the path, rather than to the entire repository.

Thoughts?

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html