Re: [PATCH 2/2] diff.c: get rid of duplicate implementation

Jeff King <peff@xxxxxxxx> · Mon, 30 Oct 2017 10:59:41 -0700

On Thu, Oct 26, 2017 at 07:43:19PM +0200, René Scharfe wrote:

> Am 25.10.2017 um 20:49 schrieb Stefan Beller:
> > The implementations in diff.c to detect moved lines needs to compare
> > strings and hash strings, which is implemented in that file, as well
> > as in the xdiff library.
> > 
> > Remove the rather recent implementation in diff.c and rely on the well
> > exercised code in the xdiff lib.
> > 
> > With this change the hash used for bucketing the strings for the moved
> > line detection changes from FNV32 (that is provided via the hashmaps
> > memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
> > on the web[1] do not indicate that these hashes are different in
> > performance for readable strings.
> > 
> > [1] https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
> 
> Awesome comparison!  It calls the variant used in libxdiff DJB2a (which
> uses xor to mix in the new char) instead of DJB2 (which uses addition).
> 
> There's also https://www.strchr.com/hash_functions, which lists DJB2
> as Bernstein.  Both functions rank somewhere in the middle of that list.

FWIW, I did some experiments with Murmur3 and SipHash a while back, but
I don't think I came up with anything faster than the existing code.
OTOH, moving to SipHash gives us the ability to randomize the hashes,
which can resist some DoS attacks (although as I've said before,
computing arbitrary diffs for untrusted strangers is pretty much a
DoS-in-a-box).

Anyway, I rebased them and ran p4000[1], with does show some results:

Test                                  origin            jk/xdl-murmur3-wip       jk/xdl-siphash-wip    
-------------------------------------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.05(0.05+0.00)   0.05(0.05+0.00) +0.0%    0.05(0.05+0.00) +0.0% 
4000.2: log --raw -3000 (tree-only)   0.30(0.28+0.02)   0.30(0.28+0.01) +0.0%    0.30(0.28+0.02) +0.0% 
4000.3: log -p -3000 (Myers)          2.06(1.98+0.08)   1.90(1.85+0.05) -7.8%    2.32(2.25+0.07) +12.6%
4000.4: log -p -3000 --histogram      2.50(2.42+0.07)   2.25(2.21+0.04) -10.0%   2.70(2.62+0.08) +8.0% 
4000.5: log -p -3000 --patience       2.58(2.52+0.06)   2.47(2.42+0.04) -4.3%    2.86(2.77+0.08) +10.9%

So it looks like murmur3 is indeed a little faster. SipHash is slower,
which is too bad (because the randomization would be nice). I _think_
back then I compared to XDL_FAST_HASH, which was a little faster even
than murmur3 (but generated too many collisions, which led to bad
behavior for certain cases).

Anyway, those branches are at https://github.com/peff/git if anybody
wants to look further. They probably need some cleanup. At the very
least, we'd probably need to teach the whitespace-ignoring hash function
to use the same algorithm.

-Peff

[1] Actually, I added "--no-merges" to each command in p4000. It seems
    silly as it is, since the point is to compute a bunch of diffs.