On Sat, Mar 12, 2016 at 07:00:26PM +0800, Kai Hendry wrote: > I penned a script to plot SLOC of a git project using GNUplot & I > thought the fastest way to count code fluctuations was via `git show > --numstat`. > > However that requires some awk counting of the lines: > https://github.com/kaihendry/graphsloc/blob/5f31e388e9b655e1801f13885f4311d221663a19/collect-stats.sh#L32 > > Is there a better way I missed? I think there is bug since my graph was > a factor of 10 out whilst graphing Linux: > https://twitter.com/kaihendry/status/706627679924174848 I think you'll always need to post-process the --numstat output to count up lines. But that's fairly minor. The bigger problem, I think is in how you handle merges. Imagine I have two branches, each of which touch the same code: git init echo base >file git add file git commit -m base echo master >>file git commit -am master git checkout -b side HEAD^ echo side >>file git commit -am side and now I merge them, resolving the conflict favor of one side: git merge master { echo base; echo master; } >file git commit -am resolved What does --numstat say? $ git log --cc --oneline --numstat 989c6f7 resolved 1 1 file b9bbaf9 side 1 0 file 087b294 master 1 0 file 09037ef base 1 0 file If we add these up, it looks like 3 lines were added. But the end result has only 2 lines! We double-counted the additions on the two branches, even though one stomped on the other. And then the merge resolution looks neutral (one line gone, one line added), even though it was where we did the stomping. To be honest, I am not sure _what_ the "--cc --numstat" is showing there (I added --cc because that is used by "git show", which is what your script uses). The actual "--cc" (and "-c") patches show nothing, which is right. It kind of looks like we are just showing the diffstat against the first parent. I'm not sure anyone ever designed what a "combined" diffstat would look like. Let's redo our merge and resolve in favor of "side": git reset --hard HEAD^ git merge master { echo base; echo side; } >file git commit -am resolved Now the numstat is blank. I think it _is_ just showing the first-parent diffstat. So that means each merge is introducing errors into your count, and you'll drift away from accurate. Another, related problem, is that you are adding up numbers along multiple simultaneous branches. So going back to my output above, at the point we read the "master" commit, we might say that the total sloc-count is 2. That's correct. And then we read "side", and say there are 3 lines. But that's not right. It doesn't build on master, so we still have only 2 lines. So even _if_ we merged them and the end result had 3 lines (or if we somehow accounted for the double-counting when examining the merge), you'd have inaccuracies through your dataset. Another way of thinking about it is that your graph wants to represent a single linear history, with the sloc-count changing as time moves to the right. But that's not what really happened; at any given time, there are _several_ sloc-counts, depending on which branch you're following. But that can also give us a clue about one solution[1]. For your purposes, you don't care about hitting every commit. You just want a bunch of linear samples of the form [timestamp, sloc] to feed to gnuplot. We can use "--first-parent" to walk _a_ linear history and see a strict progression. And then use "-m" to tell git to just show merges as the diff against that first parent (i.e., summarizing everything that happened along the side-branch we are not following). Like (this is back on the "we resolved as master" version of my example, to illustrate how the merge is shown): $ git log --first-parent -m --numstat --oneline 4244c8a resolved 1 1 file b9bbaf9 side 1 0 file 09037ef base 1 0 file And that count is right. We had one line in our base, one on the "side" commit, and then the merge didn't change our sloc-count at all (we dropped the "side" line in favor of "master"). Finally, I'll note one other thing in my examples above. Note that I used a single "git log" invocation, whereas your script reads "rev-list" output to start a series of N "git show" invocations. I'll bet that took quite a long time to run on the kernel. :) Doing it all in one git-log means you avoid the process startup overhead, and I'd expect it to run ~50 times as fast (at least that was what I saw from a quick experiment on a much smaller repo). -Peff [1] The other solution I thought of was to actually count the SLOC of each tree at each commit, rather than worrying about diffs. You can make it efficient by caching the SLOC of blob and tree sha1s, so you don't count the same files over and over. That gives you accurate SLOC-counts at any given point in time, but doesn't address the multiple-branches thing (so you'd see your SLOC bounce up and down in time as you saw commits for various branches). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html