On Sat, Sep 12, 2009 at 02:30:17PM +0200, Joseph Wakeling wrote: > I've recently begun contributing to a FOSS project that has a problem -- > although it has extensive git logs (some being CVS/SVN imports) dating > back over many years, there has not been maintenance of contribution > records on a file-by-file basis. > > I'm trying to rectify this and track down who contributed what. > Unfortunately while I'm used to basic operations with git, I don't know > it well enough to be confident in how to go about tracing contributions > in this way. We can probably help you with the git side of things, but defining "who contributed what" is kind of a hairy problem. You will need to define exactly how you want to count contributions. For example: > 'git annotate' of course is a nice starting point but of limited use > because every time someone tweaks a line (and there have been many such > tweaks in the history of the project) the responsibility of the original > contributor is replaced by that of the tweaker. But often the tweaking of the line _does_ make it their own. One of the metrics often discussed in git is "of the surviving lines in the code, how many were authored by each person". Which really is the output of "git blame" (or annotate, which is more or less the same thing). So people who contribute code that needs a lot of changes or cleanup don't get as much credit for that code, because their lines got tweaked later. It's an OK metric if you assume that lines are a good atom of contribution. That is, if I replace your line, then I remove everything of value that you added and I should get credit. That is arguably not the case with something like a style cleanup. Changing: if(i = 0; i < n; i++) to if (i = 0; i < n; i++) to fix whitespace should probably leave authorship with the original line. But I don't know if you can determine programatically how significant a change was. In the case of whitespace, "git blame" has an option to ignore whitespace changes, which probably covers a large portion of such "trivial change" cases. > An alternative is to use gitk to trace the history of individual files > (or paths, as gitk has it). The problem here is that files have been > renamed, content has been moved about between different files and so on. You can use rename detection via --follow and simply count the lines changed (and by whom) in each commit. Which differs from "git blame" strategy by counting every change as of value, even if it is a line that doesn't survive. But no, that won't handle the movement of some chunk of content from one file to another. Only "git blame" really looks at code movement on a smaller-than-file level. > I'm just hoping that the git community can offer some good advice on > this, to what extent the process of tracing contributions can be > automated, and so on. I'm not expecting anyone to provide a solution > for me, but suggestions and pointers in the possible right directions > would be much appreciated. I think it is less a git problem and more of a "how do you want to define contribution" problem. The above is just my thinking about it for a few minutes. Sverre Rabelier (cc'd) did a "git stats" GSoC project last year, but I don't think I ever looked closely at the results or what metrics he came up with. But that is probably a good direction to look in. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html