Linus Torvalds <torvalds@xxxxxxxx> writes: > However, it worries me a bit that you don't check the source strings, > especially since the hash you use seems somewhat suspect (why limit it to > essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime > that fits in the "hashval" thing, which is at least 32 bits? Also, > shouldn't you make that spanhash thing use a "unsigned int" instead of > "unsigned long"?) > > So I would suggest instead the hash function to be: > > typedef unsigned long long u64; > > /* Biggest prime in 32 bits */ > #define HASHVAL (4294967291u) Actually what you are seeing is the best compromise I've come up with. There were about a dozen crappoid that turned out to be suboptimal I threw away. The hashsize is an interesting example of such a compromise between precision and performance. It is true that full 32-bit hash gives us more precise match detection. If you change the current hash function to divide with (4294967291u), the rename similarity score assigned to some (but not all) paths in the example dataset we have been using (diff between v2.6.12 and v2.6.14) decreases by about 1% -- this comes from the fact that the hashvalue capped to 16-bit gives some false hits that larger hashbase value can tell apart. Because of it, some paths near the rename detection threshold stop being recognized as renames. However, the runtime of the same dataset increases quite a bit when we do this. I think this is because we need to keep more different hash values in the hashtable and the cost to randomly access into a huge table starts to hurt, unlike the 16-bit capped version whose hash table never needs to grow beyond 65k entries. next 39.77user 0.22system 0:40.78elapsed 0inputs+0outputs (0major+18855minor) 32-bit 66.32user 0.15system 1:07.00elapsed 0inputs+0outputs (0major+21057minor) Admittedly, we should not optimize for one particular test case, but we should start from somewhere. Decreasing the hashsize to 14-bit or less had noticeable degradation on the quality of renames the algorithm detects and misses to detect, both in v2.6.12..v2.6.14 test and some tests in git.git repository. One obvious change we could do is to make the hashsize to 0x1800D (a prime halfway between 16- and 17-bit); currently the hashtable grows double every time when the table of the current size fills up sufficiently, but the current hashbase cannot fit in 16-bit, so we _are_ already using a table with 128K entries in some cases. - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html