Hi, I think I may have found a bug in the histogram diff algorithm that manifests when there is a hash collision. This behavior seems to exist in the JGit implimentation (https://git.io/J0Ud8) too, and was brought across with the port-to-C in 8c912eea94a2. In the following, any line numbers refer to xdiff/xhistogram.c as it exists in 5d213e46bb (2.33-rc2), as seen here: https://git.io/J0UHS There are two phases in the algorithm which look up slots in the histogram hash table based on the value of a hash function: 1. During the backwards scan in `scanA`, each line in the old sequence is considered and hashes are used to assign items to histogram buckets. 2. During the forwards scan of the new sequence we call `try_lcs` with each line in turn and look up the corresponding histogram bucket. Now, the original JGit code comments suggests the intended behavior and purpose of the buckets is as follows: > To prevent the algorithm from having an O(N^2) running time, an upper > limit on the number of unique elements in a histogram bucket is > configured by `setMaxChainLength(int)`. If sequence A has more than > this many elements that hash into the same hash bucket, the algorithm > passes the region to `setFallbackAlgorithm(DiffAlgorithm)`. If no > fallback algorithm is configured, the region is emitted as a replace > edit. But when I look at both the JGit and Git implementations, this is what I actually see: During the backwards scan we select a hash table bucket (at line 115) based on the value of the item: tbl_idx = TABLE_HASH(index, 1, ptr); If there is a chain in the corresponding bucket and the values match (ie. hash to the same hash code and are equal) we prepend to the front of the chain (lines 121-133). In the event of a hash collision, we proceed to the next item in the chain (line 135): rec = rec->next; and check for a match. By definition, this check is always going to fail, because only identical elements ever get prepended onto chains. So this chain traversal serves only to measure the count of items in the chain, something that we could have done in constant time by looking up `rec->cnt` anyway. After reaching the end of the chain and exiting the loop, provided we didn't exceed the maximum chain length, we now create a new chain, overwriting the original occupant of the slot: *rec_chain = rec; So, during `scanA` we'll effectively destroy chains whenever there is a hash collision, which doesn't seem to be the intent of the original algorithm. In the second half of the algorithm, `try_lcs`, we have the inverse problem. On line 165 we again look up the hash table slot: struct record *rec = index->records[TABLE_HASH(index, 2, b_ptr)]; On line 169 we start a loop that will consider each record in the chain: for (; rec; rec = rec->next) { Due to the construction of `scanA`, we know that every item in this chain must be identical, which in turn means that in the event of a hash collision we are doomed to traverse the entire chain without finding a match: if (!CMP(index, 1, as, 2, b_ptr)) continue; That is at best unnecessary work, but also means that whenever we have collisions we will have a set of "orphaned" records that aren't actually reachable from any chain in a hash table slot that we will look up, and that doesn't seem to match the intended behavior of the algorithm, if I understand it correctly. Is this a big deal in practice? I suspect the reason we haven't noticed it is that: 1. Hash collisions are sufficiently rare (although, via the birthday paradox, not _that_ rare, especially because we use the next power of two to determine the number of slots in our hash table; ie. a 200-line file would have a 256-slot hash-table associated with it); and: 2. Their consequences tend not to produce obviously broken diffs, because the algorithm is just used to select split points around which to apply itself recursively; we still produce a valid diff even if we don't select the "optimal" split point. Anyway, I just wanted to gut-check this analysis with the list to see whether it sounds right or not. The bug is pretty benign as far as I can tell, but still probably worth fixing, if it is a bug. Cheers, Greg