This is my first stab at faster rename handling based on Andy's code. The patches are on top of next (to get Linus' recent work on exact renames). Most of the interesting stuff is in 2/3. 1/3: extension of hash interface 2/3: similarity detection code 3/3: integrate similarity detection into diffcore-rename The implementation is pretty basic, so I think there is room for code optimization (50% of the time is spent in hash lookups, so we might be able to micro-optimize that) as well as algorithmic improvements (like the sampling Andy mentioned). With these patches, I can get my monster binary diff down from about 2 minutes to 17 seconds. And comparing all of linux-2.4 to all of linux-2.6 (similar to Andy's previous demo) takes about 10 seconds. There are a few downsides: - the current implementation tends to give lower similarity values compared to the old code (see discussion in 2/3), but this should be tweakable - on large datasets, it's more memory hungry than the old code because the hash grows very large. This can be helped by bumping up the binary chunk size (actually, the 17 seconds quoted above is using 256-byte chunks rather than 64-byte -- with 64-byte chunks, it's more like 24 seconds) as well as sampling. - no improvement on smaller datasets. Running "git-whatchanged -M --raw -l0" on the linux-2.6 repo takes about the same time with the old and new code (presumably the algorithmic savings of the new code are lost in a higher constant factor, so when n is small, it is a wash). -Peff - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html