On Sat, 27 Jan 2007, Shawn O. Pearce wrote: > > _THIS_ is worth doing. I've been having a lot of discussion on > #git with Simon 'corecode' Schubert and Chris Lee about how poorly > git-blame performs compared to its counterpart in Subversion. I think we're *much* better off trying to get people off the "git-blame" mentality entirely. Don't screw up git in trying to make "git blame" performance better. Because you *will* screw it up. No ifs, buts, maybes or anything else about it. The only way to make "git blame" perform better is to do a really really crappy job. You really have to fundamnetally realize that the reason SVN and CVS can do "annotate" really cheaply is BECAUSE THEY ARE BROKEN. Trying to emulate them is only going to break git too. Really. Let that thought sink in, and let it fester in your brain until you really really get it. There simply is no way to get "git blame" to be faster, without screwing things up royally in any number of ways: - extra (redundant) on-the-fly-generated metadata Yes, you can generate caches etc. Now you need to have a way to check the caches, and to make sure that they are in sync. You also need to update them constantly - you can make them look great in *benchmarks*, but I bet that once you actually start developing, and really want to _use_ them, you'll just curse the whole thing, because the caches won't be there for the things you want. People normally don't do a million "blame" operations on the same tree - they do *one*. Caches don't work. - extra (redundant) metadata generated at commit time Instead of doing caches, you can do it statically at commit time, and now you will screw up all the *other* things, like finding content movement between files, and a dense and efficient repository encoding. You'll need to add tons of crap to the commits (that most ops won't find *any* use for), and you'll also make the operation a lot less useful because it's now static rather than dynamic. > Thoughts? Here's a really fundamental suggestion: Instead of trying to do "git blame" faster, which is a totally broken notion, just face the fact that emulating a broken environment will be slower. CPU's will get faster and help you in the long run, but more importantly, if you just "accept git". So instead, aim for: - "yes, we can do blame, but yes, it will take five seconds for a biggish archive on a reasonable CPU". Which implies that with a slow CPU, and a really humongous archive, it will take much more. Is five seconds slow enough that people think it is slow? Yes. Is 30 seconds approaching painful? Yes. But you should try to aim for really just telling people that it's not a common operation. For example, I think it is a mistake to expose blame in "gitweb". It's simply not a natural operation for git to do. Don't do it. (Side note: for the kernel - which is certainly not a "small" project, even if it's not a humongous one either, and the kinds of machines I work with, git blame usually takes about 1-2 seconds for most files. That is *not* excessive for a developer. It's excessive if you try to do it on a web-server where you've made everybody and his dog press "history" by putting a big red blinking button there.. In other words, aim for "git blame" being something that you run once a week (which is about as often as I do it) or maybe a couple of times a day if you're really obsessed. At which point a few seconds isn't that horrid. - teach people about alternatives. For example, "git log -p filename" is actually a hell of a lot more useful for most things. Yes, it's *different*, but git makes it really easy, and it has the added advantage that you see things in time order, and can very naturally search back through time. In a very similar vein, the real problem with "git blame" is not that git cannot do it, but the fact that it's a "whole history in one go" operation. Again, you can actually do a "git blame" that people would probably find much less annoying, if you just did things *incrementally*. The reason "git log -p filename" doesn't perform badly is exactly that it is incremental. Try it some time. The cost of time git log -p mm/memory.c > /dev/null time git blame mm/memory.c > /dev/null is almost 100% identical when you run them that way. So why is it that just about everybody would always consider "git log -p" to be instantaneous with git, but "git blame" is slow? I'd really like people to think about that difference between "git log -p filename" and "git blame filename". Because it tells you a lot about the *psychology* of the thing. They both take the same amount of time, but one is slow as hell, and the other one is so fast that anybody coming from the CVS world will just go "Whoah! Magic!". Really. Think about it. Now, think about what would happen if you had a graphical "git blame" that was a tcl/tk thing (and slowed things down even more), but basically filled in the "git blame" information incrementally - the exact same way that "git blame" actually calculates it internally? You know what? I bet that people would LIKE it. They could open up the file in that nice graphical interface, and scroll down/search to the part they care about, and see how git populates the blame. They'd think it's *cool*. And it would feel fast, because there wouldn't be any need to wait for *all* the information before it's done. Here's a patch. Use "git blame --incremental filename" to get the blame output in a nicely parseable format that you can now write a simple graphical viewer for. Please. Linus --- diff --git a/builtin-blame.c b/builtin-blame.c index 4a1accf..7d97ae9 100644 --- a/builtin-blame.c +++ b/builtin-blame.c @@ -27,6 +27,7 @@ static char blame_usage[] = " -p, --porcelain Show in a format designed for machine consumption\n" " -L n,m Process only line range n,m, counting from 1\n" " -M, -C Find line movements within and across files\n" +" --incremental Show blame entries as we find them, incrementally\n" " -S revs-file Use revisions from revs-file instead of calling git-rev-list\n"; static int longest_file; @@ -36,6 +37,7 @@ static int max_digits; static int max_score_digits; static int show_root; static int blank_boundary; +static int incremental; #ifndef DEBUG #define DEBUG 0 @@ -1069,6 +1071,21 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt) origin_decref(parent_origin[i]); } +static void found_guilty_entry(struct blame_entry *ent) +{ + if (ent->guilty) + return; + ent->guilty = 1; + if (incremental) { + struct origin *origin = ent->suspect; + printf("%d %d %s:%s:%d\n", + ent->lno, ent->num_lines, + sha1_to_hex(origin->commit->object.sha1), + origin->path, + ent->s_lno); + } +} + static void assign_blame(struct scoreboard *sb, struct rev_info *revs, int opt) { while (1) { @@ -1102,7 +1119,7 @@ static void assign_blame(struct scoreboard *sb, struct rev_info *revs, int opt) /* Take responsibility for the remaining entries */ for (ent = sb->ent; ent; ent = ent->next) if (!cmp_suspect(ent->suspect, suspect)) - ent->guilty = 1; + found_guilty_entry(ent); origin_decref(suspect); if (DEBUG) /* sanity */ @@ -1717,6 +1734,8 @@ int cmd_blame(int argc, const char **argv, const char *prefix) die("More than one '-L n,m' option given"); bottomtop = arg; } + else if (!strcmp("--incremental", arg)) + incremental = 1; else if (!strcmp("--score-debug", arg)) output_option |= OUTPUT_SHOW_SCORE; else if (!strcmp("-f", arg) || @@ -1907,6 +1926,9 @@ int cmd_blame(int argc, const char **argv, const char *prefix) assign_blame(&sb, &revs, opt); + if (incremental) + return 0; + coalesce(&sb); if (!(output_option & OUTPUT_PORCELAIN)) - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html