Re: How to efficiently blame an entire repo?

Jeff King <peff@xxxxxxxx> · Fri, 30 Apr 2010 17:21:03 -0400

On Thu, Apr 29, 2010 at 07:12:27PM -0400, Jay Soffian wrote:

> Let's say you've got a repo with ~ 40K files and 35K commits.
> Well-packed .git is about 800MB.
> 
> You want to find out how many lines of code a particular group of
> individuals has contributed to HEAD.
> 
> The naive solution is to run git blame on all 40K files grep'ing for
> the just the authors you want.

With the exception of your "blame only those files that you know your
authors have touched" optimization, I think you pretty much have to do
this. Anything else will just be reimplementing blame. You can't throw
away most content prematurely, because it may end up blaming to your
authors of interest eventually.

I think this is also what Junio ended up doing when presenting at
GitTogether '08:

  http://userweb.kernel.org/~junio/200810-Chron.pdf

In theory you might be able to do multi-file blame faster.  I would be
curious to see the performance difference between:

  $ git blame file1 file2 ;# not actually implemented

and

  $ for i in file1 file2; do git blame $i; done

Much of the work is O(content), but there is some overlap in walking the
history and generating diffs.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html