[PATCH/RFC 0/3] faster inexact rename handling

Jeff King <peff@xxxxxxxx> · Tue, 30 Oct 2007 00:21:19 -0400

This is my first stab at faster rename handling based on Andy's code.
The patches are on top of next (to get Linus' recent work on exact
renames). Most of the interesting stuff is in 2/3.

  1/3: extension of hash interface
  2/3: similarity detection code
  3/3: integrate similarity detection into diffcore-rename

The implementation is pretty basic, so I think there is room for
code optimization (50% of the time is spent in hash lookups, so we might
be able to micro-optimize that) as well as algorithmic improvements (like the
sampling Andy mentioned).

With these patches, I can get my monster binary diff down from about 2
minutes to 17 seconds. And comparing all of linux-2.4 to all of
linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

There are a few downsides:
  - the current implementation tends to give lower similarity values
    compared to the old code (see discussion in 2/3), but this should be
    tweakable
  - on large datasets, it's more memory hungry than the old code because
    the hash grows very large. This can be helped by bumping up the
    binary chunk size (actually, the 17 seconds quoted above is using
    256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
    more like 24 seconds) as well as sampling.
  - no improvement on smaller datasets. Running "git-whatchanged -M
    --raw -l0" on the linux-2.6 repo takes about the same time with the
    old and new code (presumably the algorithmic savings of the new code
    are lost in a higher constant factor, so when n is small, it is a
    wash).

-Peff
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html