Junio C Hamano <junkio@xxxxxxx> wrote: > What if (I know, this discussion does not belong here until > 1.5.0 final) we had a "reverse" database that keeps track of > what tag references which object, and "git rev-list" knows how > to exploit it? It'd be useful. In as many ways as you suggest. But it also would be downright difficult to transfer between repositories, as you have already pointed out in a public forum to yourself. :-) > - If a single-path following turns out to be too expensive > (there was a longstanding talk about "git log --single-follow > $path"; "git blame" also follows a single path although the > target it follows can fork into two or more when following > cut&pastes) because we need to explode multi-level trees for > each commit while traversing commit ancestry, we could define > an annotation to a commit that lists the set of paths the > commit touches relative to each of its parents (so the object > contains N lists of paths), so that pathspec limiting needs > to open and read only one object to figure out that the trees > do not have to be opened to skip the commit and/or a merge > can be simplified. _THIS_ is worth doing. I've been having a lot of discussion on #git with Simon 'corecode' Schubert and Chris Lee about how poorly git-blame performs compared to its counterpart in Subversion. Basically Subversion is storing file-level revision ancestry data within its commit data, allowing it to quickly skip back through the commits which modified that file. Git doesn't have this information and must instead traverse the entire DAG, at least until the file was initially added to the tree. With 440,000+ revisions and a file which exists in nearly all of them, getting anything from "git-blame" or "git-log -- foo.c" takes ages. Based on some (limited) profiling with Shark it seems we spend about 50% of our CPU time doing zlib decompression of objects and almost another 14% parsing the tree objects to apply the path limiter. The only way to speed up blame/log operations is to reduce the number of decompressions we need to do to the bare minium, and maybe also reduce the tree parsing overheads. Do that and we can maybe drop the running time to 1/4th the current time. One idea Simon and I were talking about was to store a reverse file/tree-level DAG in the header of each tree/blob object in the pack file. This additional header would be something like a list of triplets: (descendant-commit, ancestor-commit, ancestor-object) where: descendant-commit: the "current" commit being looked at. ancestor-commit: the commit which descendant-commit derives from (directly or indirectly) ancestor-object: prior version (descendant-commit^:path) This triplet would probably be encoded with descendant-commit using OBJ_REF, ancestor-commit being an OBJ_OFS style back reference within the pack (or OBJ_REF if not in this pack) and ancestor-object would also be an OBJ_REF. So a triplet probably would wind up costing ~60 bytes. Triplets would only be stored if ancestor-object != this-object, so basically only for the commits which changed the path the tree/blob is occupying in descendant-commit. Finding the prior revision of a tree or file would be a matter of finding the triplet which matches the current commit, then jumping through to the ancestor-* values. If no triplet matches the current commit then we peel back the parents of the current commit and try again with those. Worst case we do what we do now, which is walk the DAG. ;-) This of course penalizes objects which don't ever change, as we'd have to walk back a good chunk of the DAG before we find a matching triplet. But I would suspect that files which never change are also not given to log/blame very often either. And once we do find a triplet, we can skip through the DAG in time proportional to the rate of change for the path, rather than to the entire repository. Thoughts? -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html