On Sat, Mar 22, 2008 at 8:40 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > * Figure out which blocs of lines (not necessarily the whole files) relate > to each other by noticing that they are often modified in the same > commit. I've worked with directed graphs before (including writing my own implementation) and have written an algorithm to detect cycles in a graph. I think that this could be done by creating an undirected weighted graph of all files in a commit. If we create a graph that records how many time two files are edited in the same commit, the connection with the highest value would indicate that two files are strongly related. I'm not sure how this could be extrapolated to a section-based approach but a solution to that problem will have to be written anyway. (As with the other featires I'll need to be able to keep track of lines, the mechanism to be developed for that can be used here also.) > * Who are early birds and who are late night owls? Who are day-job > contributors and who are weekenders? Sounds like a 'fun feature', but how about timezones? I'm not sure how commit times are recorded, in UTC, if so, does it also record their timezone? > * Identify "buggy commits" from history, without testing. Zeroth order > approximation is that the lines it introduced were later rewritten by > other later commits, but the later ones are not necessarily fixes but > can be enhancements, so you would need a way to tell which ones are > "fixing commits" and which ones are not. You may want to use project > specific hints to help you doing this: A feature like this would fit well with the other "buggy commit/maintainer detection" but would require a lot of customization. However, considering git already comes with a good customization system it should still be feasible. > - a log that matches /This(?: commit) fixes/ is likely to be a fix; Perhaps a regexp could be configured that marks a commit message as being a fix? > - a commit that touches the same vicinity of another commit after a > short interval is likely to be a fix; Do you mean with "touches the same vicinity " something like "edits code within 5 lines and within 5 commits of a commit x"? > - a commit that is made on 'maint' branch by definition is a fix; Either a list of branches that are maintenance branches or a regexp would be in place again I think. > - a commit that changes test_expect_failure to test_expect_success have > a high probability that it itself is a fix, or it comes soon after a fix; I'm not sure I understand this but that's probably because I'm not yet familiar with git's testing suite. Do you think a general rule to identify changes like this can be made? > * For the integrator, can you spot a pattern like "what he accepts > during weekdays tend to be buggier than what he applies during > weekends"? That would be interesting data, I think a nice graph could be made easily, showing a column for weekdays (or one for each day) and a column for weekends (or one for each day). Each column could then represent the amount of buggy commits / day, or perhaps the ration buggy/enhancements. This histogram could then go back several weeks to give a better picture. Perhaps a line style graph with two lines could be made, one for the weekends and one for the weekdays, or seven lines, one for each day. That way it would be easy to track if the integrator is getting better at his job, or that he is perhaps having a bad/good period lately. > * For each contributor, can you spot a pattern like "his late night > commits are buggier than his early morning commits"? This would be a 'fun feature' again I think, although it could of course be used to decide that 'late night commits' of this contributor should be examined more carefully. > * Can you spot a pattern like "his changes to this area tends to be > buggy but to that area tends to be very good"? This would require connecting commits to area's, that is, track what area's the buggy commits apply to. Maybe instead of tracking this on a commit basis a per-file basis might be more interesting. That is, don't just track if a commit is buggy, but also if a specific change to a certain file is buggy. Doing so would allow for more careful tracking of the area's a developer provides good work in. > * Who tends to introduce more bugs, who tends to do more fixes than > enhancements? The former is an confronting yet interesting statistic, something that could best be presented in a pie chart or such. The latter could be shown as a bar chart in which each bar is divided into three parts 'buggy', 'fixes', and 'enhancements', with one bar per contributor. > * Is their correlation between being a day-job contributor and being > more fixer than bug-introducer? This would require information about whether a contributor has a day job, although this might be inferred from the commit times feature mentioned earlier. It might be nice to have this feature to help decide what kind of work to assign a contributor to (in the case that contributors are assigned a task). The question now though, is which of these features are feasible to do in one GSoC project? That is, which one should be done first, as I want to finishing this feature even if I can't finish it all in three months. Should this be something that is decided in the application already, or should I list all the features and then later on decide (with the aid of the community) which ones to implement first. Thank you for your suggestions, this is starting to be very interesting indeed! Sverre -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html