Re: Measuring Community Involvement (was Re: Contributor Summit planning)

Jeff King <peff@xxxxxxxx> · Tue, 14 Aug 2018 15:36:46 -0400

On Tue, Aug 14, 2018 at 01:43:38PM -0400, Derrick Stolee wrote:

> On 8/13/2018 5:54 PM, Jeff King wrote:
> > So I try not to think too hard on metrics, and just use them to get a
> > rough view on who is active.
> 
> I've been very interested in measuring community involvement, with the
> knowledge that any metric is flawed and we should not ever say "this metric
> is how we measure the quality of a contributor". It can be helpful, though,
> to track some metrics and their change over time.
> 
> Here are a few measurements we can make:

Thanks, it was nice to see a more comprehensive list in one spot.

It would be neat to have a tool that presents all of these
automatically, but I think the email ones are pretty tricky (most people
don't have the whole list archive sitting around).

> 2. Number of other commit tag-lines (Reviewed-By, Helped-By, Reported-By,
> etc.).
> 
>     Using git repo:
> 
>     $ git log --since=2018-01-01 junio/next|grep by:|grep -v
> Signed-off-by:|sort|uniq -c|sort -nr|head -n 20

At one point I sent a patch series that would let shortlog group by
trailers. Nobody seemed all that interested and I didn't end up using it
for its original purpose, so I didn't polish it further.  But I'd be
happy to re-submit it if you think it would be useful.

The shell hackery here isn't too bad, but doing it internally is a
little faster, a little more robust (less parsing), and lets you show
more details about the commits themselves (e.g., who reviews whom).

> 3. Number of threads started by user.

You have "started" and "participated in". I guess one more would be
"closed", as in "solved a bug", but that is quite hard to tell without
looking at the content. Taking just the last person in a thread as the
closer means that an OP saying "thanks!" wrecks it. And somebody who
rants long enough that everybody else loses interest gets marked as a
closer. ;)

> If you have other ideas for fun measurements, then please let me know.

I think I mentioned "surviving lines" elsewhere, which I do like this
(and almost certainly stole from Junio a long time ago):

  # Obviously you can tweak this as you like, but the mass-imported bits
  # in compat and xdiff tend to skew the counts. It's possibly worth
  # counting language lines separately.
  git ls-files '*.c' '*.h' :^compat :^contrib :^xdiff |
  while read fn; do
    # eye candy
    echo >&2 "Blaming $fn..."

    # You can use more/fewer -C to dig more or less for code moves.
    # Possibly "-w" would help, though I doubt it shifts things more
    # than a few percent anyway.
    git blame -C --line-porcelain $fn
  done |
  perl -lne '/^author (.*)/ and print $1' |
  sort | uniq -c | sort -rn | head

The output right now is:

  35156 Junio C Hamano
  22207 Jeff King
  17466 Nguyễn Thái Ngọc Duy
  12005 Johannes Schindelin
  10259 Michael Haggerty
   9389 Linus Torvalds
   8318 Brandon Williams
   7776 Stefan Beller
   5947 Christian Couder
   4935 René Scharfe

which seems reasonable.

-Peff