Bad behavior in xhistogram.c in the face of hash collisions?

"Greg Hurrell" <greg@xxxxxxxxxxx> · Mon, 16 Aug 2021 14:40:27 +0200

Hi,

I think I may have found a bug in the histogram diff algorithm that
manifests when there is a hash collision.  This behavior seems to exist
in the JGit implimentation (https://git.io/J0Ud8) too, and was brought
across with the port-to-C in 8c912eea94a2.

In the following, any line numbers refer to xdiff/xhistogram.c as it
exists in 5d213e46bb (2.33-rc2), as seen here: https://git.io/J0UHS

There are two phases in the algorithm which look up slots in the
histogram hash table based on the value of a hash function:

1. During the backwards scan in `scanA`, each line in the old sequence
   is considered and hashes are used to assign items to histogram
   buckets.
2. During the forwards scan of the new sequence we call `try_lcs` with
   each line in turn and look up the corresponding histogram bucket.

Now, the original JGit code comments suggests the intended behavior and
purpose of the buckets is as follows:

> To prevent the algorithm from having an O(N^2) running time, an upper
> limit on the number of unique elements in a histogram bucket is
> configured by `setMaxChainLength(int)`. If sequence A has more than
> this many elements that hash into the same hash bucket, the algorithm
> passes the region to `setFallbackAlgorithm(DiffAlgorithm)`. If no
> fallback algorithm is configured, the region is emitted as a replace
> edit.

But when I look at both the JGit and Git implementations, this is what I
actually see:

During the backwards scan we select a hash table bucket (at line 115)
based on the value of the item:

    tbl_idx = TABLE_HASH(index, 1, ptr);

If there is a chain in the corresponding bucket and the values match
(ie. hash to the same hash code and are equal) we prepend to the front
of the chain (lines 121-133).

In the event of a hash collision, we proceed to the next item in the
chain (line 135):

    rec = rec->next;

and check for a match. By definition, this check is always going
to fail, because only identical elements ever get prepended onto
chains. So this chain traversal serves only to measure the count of
items in the chain, something that we could have done in constant time
by looking up `rec->cnt` anyway.

After reaching the end of the chain and exiting the loop, provided we
didn't exceed the maximum chain length, we now create a new chain,
overwriting the original occupant of the slot:

    *rec_chain = rec;

So, during `scanA` we'll effectively destroy chains whenever there is a
hash collision, which doesn't seem to be the intent of the original
algorithm.

In the second half of the algorithm, `try_lcs`, we have the inverse
problem. On line 165 we again look up the hash table slot:

    struct record *rec = index->records[TABLE_HASH(index, 2, b_ptr)];

On line 169 we start a loop that will consider each record in the chain:

    for (; rec; rec = rec->next) {

Due to the construction of `scanA`, we know that every item in this
chain must be identical, which in turn means that in the event of a hash
collision we are doomed to traverse the entire chain without finding a
match:

    if (!CMP(index, 1, as, 2, b_ptr))
           continue;

That is at best unnecessary work, but also means that whenever we have
collisions we will have a set of "orphaned" records that aren't actually
reachable from any chain in a hash table slot that we will look up, and
that doesn't seem to match the intended behavior of the algorithm, if I
understand it correctly.

Is this a big deal in practice? I suspect the reason we haven't noticed
it is that:

1.  Hash collisions are sufficiently rare (although, via the birthday
    paradox, not _that_ rare, especially because we use the next power
    of two to determine the number of slots in our hash table; ie. a
    200-line file would have a 256-slot hash-table associated with it);
    and:
2.  Their consequences tend not to produce obviously broken diffs,
    because the algorithm is just used to select split points around
    which to apply itself recursively; we still produce a valid diff
    even if we don't select the "optimal" split point.

Anyway, I just wanted to gut-check this analysis with the list to see
whether it sounds right or not. The bug is pretty benign as far as I can
tell, but still probably worth fixing, if it is a bug.

Cheers,
Greg