Re: git-diff-tree -M performance regression in 'next'

Junio C Hamano <junkio@xxxxxxx> · Sun, 12 Mar 2006 20:14:21 -0800

Linus Torvalds <torvalds@xxxxxxxx> writes:

> However, it worries me a bit that you don't check the source strings, 
> especially since the hash you use seems somewhat suspect (why limit it to 
> essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime 
> that fits in the "hashval" thing, which is at least 32 bits? Also, 
> shouldn't you make that spanhash thing use a "unsigned int" instead of 
> "unsigned long"?)
>
> So I would suggest instead the hash function to be:
>
> 	typedef unsigned long long u64;
>
> 	/* Biggest prime in 32 bits */
> 	#define HASHVAL (4294967291u)

Actually what you are seeing is the best compromise I've come up
with.  There were about a dozen crappoid that turned out to be
suboptimal I threw away.

The hashsize is an interesting example of such a compromise
between precision and performance.  It is true that full 32-bit
hash gives us more precise match detection.  If you change the
current hash function to divide with (4294967291u), the rename
similarity score assigned to some (but not all) paths in the
example dataset we have been using (diff between v2.6.12 and
v2.6.14) decreases by about 1% -- this comes from the fact that
the hashvalue capped to 16-bit gives some false hits that larger
hashbase value can tell apart.  Because of it, some paths near
the rename detection threshold stop being recognized as renames.

However, the runtime of the same dataset increases quite a bit
when we do this.  I think this is because we need to keep more
different hash values in the hashtable and the cost to randomly
access into a huge table starts to hurt, unlike the 16-bit
capped version whose hash table never needs to grow beyond 65k
entries.

        next    39.77user 0.22system 0:40.78elapsed
                0inputs+0outputs (0major+18855minor)

        32-bit  66.32user 0.15system 1:07.00elapsed
                0inputs+0outputs (0major+21057minor)

Admittedly, we should not optimize for one particular test case,
but we should start from somewhere.

Decreasing the hashsize to 14-bit or less had noticeable
degradation on the quality of renames the algorithm detects and
misses to detect, both in v2.6.12..v2.6.14 test and some tests
in git.git repository.

One obvious change we could do is to make the hashsize to
0x1800D (a prime halfway between 16- and 17-bit); currently the
hashtable grows double every time when the table of the current
size fills up sufficiently, but the current hashbase cannot fit
in 16-bit, so we _are_ already using a table with 128K entries
in some cases.

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html