Re: Trac+Git: rev-list with pathspec performance?

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 7 Oct 2010 22:33:59 +0200

On Thu, 7 Oct 2010, Stephen Bash wrote:

>>> Note that there is proof of concept
>>> "tree blame" (in Perl) which generates such 'last change to file'
>>> information, I think faster than running 'git rev-list -1 <file>'
>>> for
>>> each file. Even better would be to encode used algorithm in C.
>>>
>>> http://thread.gmane.org/gmane.comp.version-control.git/150063/focus=150183
>> 
>> My early experiments with your script are good for speed, but for some
>> reason I'm always getting the first commit for a file rather than the
>> most recent. I'll do some experimenting to see if I can uncover the
>> issue.
> 
> Following up, I had to add -r to the diff-tree command line when
> requesting a subdirectory to work around the problem (script always
> returned the first commit).  

Hmmm... I thought that I have added '-r' if there is path provided,
i.e. we don't run tree blame on root commit.

> I'm curious if it's faster to get the SHA of the sub-tree and compare
> that before actually running diff-tree?  And for that matter, just run
> diff-tree on the sub-tree that we care about rather than a recursive
> sub-tree on the root?  These may be early optimizations, but they're
> ideas that occurred to me while debugging the code...    

There are many possible optimizations (see also below); for the time
being I was concerned with getting the fast tree blame algorithm right
(and as you can see didn't get it, not completely).

>>> P.S. Alternate solution would be to simply get rid of SVN-inspired
>>> view.  Git tracks history of a *project* as a whole, not set of
>>> histories for individual files (like CVS).
> 
> After a lot of experimentation, this is basically what we did.
> I modified the Trac templates to not list the last change SHA or log
> message in the directory view.  After all my testing, I just don't
> think there's a fast way to get this information from Git.  This
> blame-dir script is the fastest alternative I've tried (about 5x
> faster than rev-list'ing each file), but it's still ~30 seconds on my
> machine (which is faster than our web server), and IMHO that's too
> long to ask a user to wait for a page to load.       

First, there is lot of room for optimization of tree blame script, some
of which I have noted as comments, some which you have found.  During
developing this script I noticed that current plumbing doesn't completly
fit the tree blame algorithm; for example we need '-r' for blaming 
subtree (subdirectory), while we need paths only up to depth of blamed
directory, no more.

Rewriting tree-blame in C, using in-core revision and tree traversal
should be faster, though I'm not sure how much would that be.  
Unfortunately I don't know enough git API; I thought that writing
Perl script would be easier.

But you are right in that such view would always be expensive in Git,
because Git tracks history of porject *as a whole*.  If file was 
created in root commit (first commit) and left unchanged, it would be
easy to find in VCS that stored history on per-file basis at least to
some extent; in git you have to go through comit up till the root 
commit in this case.  If history is long, it might take some time.

Second, you can use the trick that GitHub web interface uses to display
similar view, namely in displaying first just a tree of files, and then
incrmentally filling in 'last changed' info.  Gitweb does something 
similar in 'blame_incremental' view; that is why the idea was to have
tree blame ("git blame <directory>") to have support for incremental 
format, similar to an ordinary blame.

This might take some effort to develop, though...
-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html