Re: [PATCH/RFC 0/3] faster inexact rename handling

"Andy C" <andychup@xxxxxxxxx> · Tue, 30 Oct 2007 17:06:02 -0700

Sorry I have been AWOL... I was going to try to work on this, but I
got abjectly sick (long story).  But it's great to see this out.

On 10/29/07, Jeff King <peff@xxxxxxxx> wrote:
> This is my first stab at faster rename handling based on Andy's code.
> The patches are on top of next (to get Linus' recent work on exact
> renames). Most of the interesting stuff is in 2/3.
>
>   1/3: extension of hash interface
>   2/3: similarity detection code
>   3/3: integrate similarity detection into diffcore-rename
>
> The implementation is pretty basic, so I think there is room for
> code optimization (50% of the time is spent in hash lookups, so we might
> be able to micro-optimize that) as well as algorithmic improvements (like the
> sampling Andy mentioned).

For microoptimization, I was thinking that the hash tables could be
implemented without pointers per value (or memory allocation per
value), so everything is in a contiguous block of memory.   In C++ you
can do this trivially by declaring a small struct as the second
template parameter of the container; in C I guess you can simulate it
with a macro or something.

For the inverted indexing step, the values in the hash are going to be
quite small, especially if line_threshold=1.  Then you only need 2
integers for the left side and right side == 4 integers.  The integers
could just be indexes into the lists (like the current code uses).

For the count matrix step, the values are just going to be integers,
so storing it right in the hash table makes sense.

The sampling should be only necessary for binary files, I think.

> With these patches, I can get my monster binary diff down from about 2
> minutes to 17 seconds. And comparing all of linux-2.4 to all of
> linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

Hopefully that should be close to just reading the files off disk.
The algorithm should take a fraction of the time that simply reading
the files does, which presumably a git diff has to do.

I was timing that by comparing it to doing a "| xargs wc -l" on the
lists of files.

> There are a few downsides:
>   - the current implementation tends to give lower similarity values
>     compared to the old code (see discussion in 2/3), but this should be
>     tweakable
>   - on large datasets, it's more memory hungry than the old code because
>     the hash grows very large. This can be helped by bumping up the
>     binary chunk size (actually, the 17 seconds quoted above is using
>     256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
>     more like 24 seconds) as well as sampling.
>   - no improvement on smaller datasets. Running "git-whatchanged -M
>     --raw -l0" on the linux-2.6 repo takes about the same time with the
>     old and new code (presumably the algorithmic savings of the new code
>     are lost in a higher constant factor, so when n is small, it is a
>     wash).

I think the old code tries to respect the cache as much as possible,
from what I can tell.  The new code has to use hash tables which are
unpredictable of course.  Though for smaller data sets I would expect
the hash table to fit in cache.  What's your definition of small here?
 Are you sure the old code isn't triggering one of the limits that was
there?

thanks,
Andy
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html