On Thu, Oct 26, 2017 at 07:43:19PM +0200, René Scharfe wrote: > Am 25.10.2017 um 20:49 schrieb Stefan Beller: > > The implementations in diff.c to detect moved lines needs to compare > > strings and hash strings, which is implemented in that file, as well > > as in the xdiff library. > > > > Remove the rather recent implementation in diff.c and rely on the well > > exercised code in the xdiff lib. > > > > With this change the hash used for bucketing the strings for the moved > > line detection changes from FNV32 (that is provided via the hashmaps > > memhash) to DJB2 (which is used internally in xdiff). Benchmarks found > > on the web[1] do not indicate that these hashes are different in > > performance for readable strings. > > > > [1] https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed > > Awesome comparison! It calls the variant used in libxdiff DJB2a (which > uses xor to mix in the new char) instead of DJB2 (which uses addition). > > There's also https://www.strchr.com/hash_functions, which lists DJB2 > as Bernstein. Both functions rank somewhere in the middle of that list. FWIW, I did some experiments with Murmur3 and SipHash a while back, but I don't think I came up with anything faster than the existing code. OTOH, moving to SipHash gives us the ability to randomize the hashes, which can resist some DoS attacks (although as I've said before, computing arbitrary diffs for untrusted strangers is pretty much a DoS-in-a-box). Anyway, I rebased them and ran p4000[1], with does show some results: Test origin jk/xdl-murmur3-wip jk/xdl-siphash-wip ------------------------------------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.05(0.05+0.00) 0.05(0.05+0.00) +0.0% 0.05(0.05+0.00) +0.0% 4000.2: log --raw -3000 (tree-only) 0.30(0.28+0.02) 0.30(0.28+0.01) +0.0% 0.30(0.28+0.02) +0.0% 4000.3: log -p -3000 (Myers) 2.06(1.98+0.08) 1.90(1.85+0.05) -7.8% 2.32(2.25+0.07) +12.6% 4000.4: log -p -3000 --histogram 2.50(2.42+0.07) 2.25(2.21+0.04) -10.0% 2.70(2.62+0.08) +8.0% 4000.5: log -p -3000 --patience 2.58(2.52+0.06) 2.47(2.42+0.04) -4.3% 2.86(2.77+0.08) +10.9% So it looks like murmur3 is indeed a little faster. SipHash is slower, which is too bad (because the randomization would be nice). I _think_ back then I compared to XDL_FAST_HASH, which was a little faster even than murmur3 (but generated too many collisions, which led to bad behavior for certain cases). Anyway, those branches are at https://github.com/peff/git if anybody wants to look further. They probably need some cleanup. At the very least, we'd probably need to teach the whitespace-ignoring hash function to use the same algorithm. -Peff [1] Actually, I added "--no-merges" to each command in p4000. It seems silly as it is, since the point is to compute a bunch of diffs.