[SoC RFC] git statistics - information about commits

"alturin marlinon" <alturin@xxxxxxxxx> · Fri, 21 Mar 2008 09:52:38 +0100

Heya,

With regard to Google Summer of Code's application deadline closing in
fast, I could really appreciate it to get some feedback on my
application so far. Especially on what parts of this idea would be
appreciated the most, and what parts could be done without.

I have been using git on several projects so far and am very happy
with the way it does things.
When looking at TortoiseSVN I noticed that it comes with a
'statistics' button that allows you to see which users have done what.
Even though it is limited in that it can only show how many commits
were made, I think this is an important feature to any VCS. I became
aware of the importance of statistics during a project at my
University (we had to use Subversion). During the project I noticed I
used these statistics to talk about fair distribution of work, and it
really helped to get everybody's nose pointing in the right direction.
Keeping that in mind, I tried to get such statistics from git. Git
provides a 'commits per user' feature under 'git shortlog -s -n -c
master' (note the order of the switches).

Consider Ohloh, an external tool that provides commit information
about contributors to a project.
It provides with a quick over of all contributors to a project, and
what their contribution has been so far. At the moment git does not
have anything similar, even though all the data needed for such an
analysis is present. Integration with gitk and git-web would allow the
data to be presented in a clear and informative way.

Another bit of interesting information would be 'who is maintaining
this code?'. Such information is especially useful when trying to
decide whom to send a copy of a patch. Consider that git already
contains the e-mail address of each developer that maintains a certain
bit of code (this information is included in each commit). If we now
find out who maintains the code that was changed in a commit
git-format-patch could automatically include them in the cc: field.
Similarly, one might be interested in what code a maintainer is
currently working on.

In a more broad sense it might be interesting to determine what part
of the code is most actively worked on, and what part of the code is
most stable. This is most interesting when deciding whether an API is
ready to be published. (If the API is changing a lot it might be
better to wait till it has stabilized.) This information could even be
used to find 'edit wars'. (In which a part of the code is changed over
and over again.)

My plan for this summer is to create a 'statistics' feature for git.

It would provide the following functionality:
* Show how many commits a specific user made.
* Show the (average) size of their changes (in lines for example).
* Show a 'total diff', that is, take the difference between the source
with, and without their changes, including its size (with for example
a -c switch).
* Show which contributors have contributed to the part of the code
that a patch modifies.
* Show what part of the code a maintainer is working on the most.
* Define an output format for this information that can be used by
other tools (such as gitk and git-web)
* (Optional) Integrate all this information with gitk and git-web.

Implementation would probably start out with python scripts since
those are easy to modify and combine with other scripts. As milestones
are reached in time, or ahead of time, attention could be shifted to
converting these to C and combining them with the rest of git. When
the other milestones are finished time could be spent on using the
newly added features in gitk and/or git-web.

To achieve all these milestones heavy usage can be made of existing
git commands. For example, getting the total amount of commits from a
maintainer can be achieved with the less-than-intuitive 'git shortlog
-s -n -c master', providing an alias to this command would make it
easier to use this functionality. Since other git commands will be
used a lot, performance may suffer as a result of piping/parsing
results from one command to another. When a feature is converted to C
later on attention could be given to directly passing the result from
one function to another.

Determining which users have been active on a file git's built in
'blame' functionality can be used. Git blame is very fast it would be
no problem to make extensive use of it in determining maintainer
focus. In a similar way it can be used to determine who has worked on
a file recently.

I am a Dutch student, doing my Bachelor at 'Delft University of
Technology'. I study 'Technische Informatica', Dutch for 'Computer
Science'. Even before starting fourth grade in Highschool I learned
C++ so that I could help out as a coder on a MUD (Multi User Dungeon).
In grade four through six I followed the optional "Informatica" (a
High school version of 'Computer Science') course. We learned Java and
SQL, nothing too difficult, but it got me wanting to learn more. I
learned to learn other languages on my own, probably valuable thing I
learned.

I have used git on numerous projects so far, although some of its more
elaborate features I am not yet familiar with. My motivation for this
particular idea I have described above. Enjoying working with git made
me want to work on it as my Google of Summer project. Knowing that an
original idea has more chance of being selected I spent a lot of time
looking for ways to improve git worth a GSoC of coding. I'm really
looking forward to coding for git and I think GSoC would be an awesome
introduction to it's codebase but also to contributing to a large
project.

Thank you for your time and attention,

Sverre Rabbelier
(SRabbelier on #git)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html