On Tue, Jul 31, 2012 at 01:20:42PM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > A much better hint is to annotate pairs of sha1s, to say "do not bother > > doing inexact rename correlation on this pair; I promise that they have > > value N". > > Surely. And I suspect that the patch to the current codebase to do > so would be much less impact if you go that way. Yes. You may remember I wrote a generic caching subsystem last summer when we were talking about caching commit generations. Plugging in a new map type to map sha1 pairs to 32-bit integers was pretty simple, and that gives the basis for a rename cache. It's fairly unimpressive on git.git. My best-of-five for "git log --format=%H --raw -M" went from 5.83s to 5.74s, which is pretty much within the run-to-run noise. The resulting cache was 155K. However, it's easy to come up with much more pathological cases. I have a really ugly rename-and-tweak-tags commit on my photo repository, and those blobs are relatively big. My timings for "git show" on that were: before: 49.724s after, run 1: 54.904s after, run 2: 0.117s Which is pretty darn nice. The resulting cache is 5.3M (the repository itself is in the gigabytes, but that's not really relevant; the cache will obviously scale with the number of paths, not with the size of the blobs). It would also work for copies, too, of course. Here are the results of "git log --format=%H --raw -M -C -C" on git.git: before: 1m35s after, run 1: 39.7s after, run 2: 39.5s So it does make much more of a difference for copies, which is obvious; git is doing a lot more work for us to cache. At the same time, our cache is much bigger: 32M. Yikes. My cache is fairly naive, in that it literally stores 44 bytes of <src_sha1, dst_sha1, score> for each entry. At the cost of more complexity, you could store each src_sha1 once, followed by a set of <dst_sha1, score> pairs. I also didn't take any special care to avoid duplicates of <X, Y> and <Y, X> (since presumably these renames would be commutative). I'm not sure it is necessary, though; I think the copy machinery already suppresses this when entries are in both source and destination lists. So I don't know. It can definitely speed up some operations, but at the cost of a non-trivial cache on disk. I'll spare you all of the generic caching infrastructure, but the actual patch to rename looks like this (just to give a sense of where the hooks go): diff --git a/diffcore-rename.c b/diffcore-rename.c index 216a7a4..db70878 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -6,6 +6,7 @@ #include "diffcore.h" #include "hash.h" #include "progress.h" +#include "metadata-cache.h" /* Table of rename/copy destinations */ @@ -137,7 +138,8 @@ static int estimate_similarity(struct diff_filespec *src, */ unsigned long max_size, delta_size, base_size, src_copied, literal_added; unsigned long delta_limit; - int score; + uint32_t score; + struct sha1pair pair; /* We deal only with regular files. Symlink renames are handled * only when they are exact matches --- in other words, no edits @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src, if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE) return 0; + hashcpy(pair.one, src->sha1); + hashcpy(pair.two, dst->sha1); + if (rename_cache_get(&pair, &score)) + return score; + if (!src->cnt_data && diff_populate_filespec(src, 0)) return 0; if (!dst->cnt_data && diff_populate_filespec(dst, 0)) @@ -195,6 +202,7 @@ static int estimate_similarity(struct diff_filespec *src, score = 0; /* should not happen */ else score = (int)(src_copied * MAX_SCORE / max_size); + rename_cache_set(&pair, &score); return score; } -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html