[Once again I forgot to "reply to all", sorry Jakub ;)] Heya, On Tue, Jul 8, 2008 at 2:27 AM, Jakub Narebski <jnareb@xxxxxxxxx> wrote: > I'd very much like to have (or perhaps to wrote) some sort of interim > progress report for Google Summer of Code 2008 projects on > http://git.or.cz/gitwiki/SoC2008Projects. Therefore I'd like you to > expand and/or correct those mini-summaries below. Would you like us to go to the wiki page and edit them ourselves, yes? If so, I'll see about kicking up something interesting to put up there soon. > (It would be, I guess, good preparation for GSoC 2008 mid-term > evaluations, which according to GSoC 2008 timeline > http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_timeline > are to take place July 7 -- July 14.) Agreed. > 2. git-statistics (?) > > Student: Sverre Rabbelier > Mentor: David Symonds > > There were some posts about how git-statistics can be used: > http://thread.gmane.org/gmane.comp.version-control.git/81534 > http://thread.gmane.org/gmane.comp.version-control.git/82027 > but it was mainly theoretical thread > > The git-stats repository at http://repo.or.cz/w/git-stats.git (gitweb) > has some commits, but I don't remember any of them on git mailing list. > Not ready? I am not yet sure how I should send in my code for peer review. $ git rev-list master | wc -l 151 A bunch of these patches introduce significant changes, a few are fixes, and some are only changes to the changelog [0] which is actually a pretty good way to see what I've been up to. Credit here goes to David for suggesting I keep one. I am contemplating sending in one patch per module since I've structured my code in such a way that each module is mostly a stand-alone file that can be reviewed in itself. I am not sure if there is any way to get such code reviewed, most other projects had several distinct steps (e.g., modifying parts of existing code first) they had to go through before they could write new code. My project has been "new code" from the get go, so any advice as to how to send this in for peer review is very welcome. I have been working on GitStats, an attempt to gather metrics on a git repository. For those who would like to know more about my goals, please have a look at [1] or the more general [2]. Currently I have quite a few metrics done, most of them are aimed at detecting whether a commit is a bugfix. A short listing of metrics done: * stats.py author -d: Shows file activity of a specific developer measured in how often they made a modification to a file and total lines added/removed (much like a diffstat, but now for a specific developer instead of one commit). * stats.py author -f: Shows file activity of a specific file meaured in how often they made a modification to a file, could be extended to also count changes like "author -d" * stats.py branch -b: Shows which branch a specific file belongs to, for more information on this metric see below * stats.py commit -v: Shows all commits that are reverted by the specified commit, will be extended to allow detection of partial reverts * stats.py diff -e: Shows whether two specific commits introduce the same changes * stats.py diff -e -n: ditto, but ignores what changes were made, only looks at the changed lines * stats.py diff -e -i: ditto, but inverts the comparison, instead of comparing addition with addition and deletions with deletions, the additions of the first diff are compared with the deletions of the second diff, and vise versa. This way a revert is easily detected. * stats.py index -t: Shows which commits touched the same paths as the staged changes I am working now on getting the 'is a bugfix' metric going, it's sub-metrics are mostly done ('branch contains' , 'is revert', 'commit msg matches', 'commit diff matches') with the exception of a few simple ones ('is partial revert', 'modifies same lines') that are already almost done. As a sneak peek into what I've been up to I'll zoom in a bit on the development of the 'belongs to' metric. One of the hardest things to tackle was the 'belongs to' metric. It's goal is to determine how much a certain commit 'belongs to' a specific branch. When aggregating this metric one can pick the best fit. This is useful when trying to figure out if a commit was made on a maintenance branch, and as such whether it should be marked as a 'fix'. The definition of "belonging to" a branch was made as follows: "Each branch that contains the target commit begins with a 'dilution' of 0, for each merge on the way between it and the commit, increase the dilution by one if it was not the primary parent of the merge. This means that if a commit was "merged into" a branch instead of having been made on that branch, it will have a higher dilution than if the commit was made on that branch. This way, the lower the dilution, the more a commit belongs on a branch. The problem with this metric was not in writing it, but in getting it to not be terribly slow on large repositories. A few of the major problems include: * Do not call git-rev-list on each commit, but instead gather all 'parentage' information beforehand. This avoids a lot of calls to the git binary, which is expensive. This simple optimization resulted in a huge speed increase, even on the small tests. With this in place it runs in under a second for the test repository, but the git.git repo still takes ages. * Do not revisit commits that have already been checked _and had a lower or equal dilution_ in the previous check. Especially in the git.git repository this optimization made the algorithm go from 'unusable' to useful. With this in place it runs in under a second for recent commits. It takes under 20 seconds for the first commit in the repository. * Keep a global memory too, that is, when multiple branches contain the commit, remember that we have already visited a commit in a global memory as well. When visiting a commit we check if we already visited it before but with a lower dilution, if so, this means that another branch is always going to be better than the current branch at this point, as such, we stop walking there. With this in place it still runs in under a second for recent commits. It takes about 7 seconds for the first commit in the repository. * Do a rev-list on all branches we are interested in once, instead of once for each branch. In order to cut down the output by not listing 'subsets', the rev-lists of each individual branch was needed. By not doing this filtering a lot of time is saved. With this in place it still runs in under a second for recent commits. It takes a little over 5 seconds for the first commit in the repository. The above timings were made on a rather outdated version of git.git, after updating to today's latest the timings are as follows: $ time stats.py branch -b e83c516331 Matching branches: pu next master maint offcuts real 0m6.360s user 0m6.228s sys 0m0.244s It is plausible that there is no way to do this any faster with the current approach. Walking all commits, examining them, and calculating the dilution, all the way to the root commit (on the git.git repo) just takes that long. The algorithm itself is almost instant, most of the time is spent waiting for git rev-list to return. For a recent commit on the maint branch we can find the following information (the -d flag was included to include 'debug' information, so that we can see what other information was found, but left out in the regular report): sverre@Laptop-Sverre:~/code/git$ time stats.py branch -b 2b2828b -d Retreiving branches that contain the commit... Gathering information on the history... Done. Checking branches now: origin/next master origin/maint origin/pu Done. Listing found metrics: Branch next, dilution: 2. Branch master, dilution: 1. Branch maint, dilution: 0. Done. Matching branches: The minimal dilution is: 0 maint real 0m6.431s user 0m6.164s sys 0m0.260s >From the above one can see that master has recently merged in that commit, but that next did not include it until after it merged in master. In this way a commit can cascade through multiple merge, earch merge increasing it's dilution by one. I am very interested to hear comments on my progress so far, but also on what is thought to be "important to work on next". It would be awesome if a few people could give it a test drive. I recommend using the setupRepo.py script in 'src/scripts' which will create a metricsrepo in /tmp/ that is very well suited to experiment with the 'belongs to' metric. The testrepo also created in /tmp is better suited to test some of the other metrics. Both repositories are used by the testcases under 'src/t', which should all pass :). My repository can be found at [3]. Thank you for reading, and I'm looking forward to review/comments. [0] http://repo.or.cz/w/git-stats.git?a=blob;f=doc/changelog.txt [1] http://alturin.googlepages.com/Use_cases.html [2] http://alturin.googlepages.com/gsoc2008 [3] http://repo.or.cz/w/git-stats.git -- Cheers, Sverre Rabbelier -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html