On Mon, 13 Mar 2006, Junio C Hamano wrote: > > By the way, the reason the diffcore-delta code in "next" does > not do every-eight-bytes hash on the source material is to > somewhat alleviate the problem that comes from not detecting > copying of consecutive byte ranges. Yes. However, there are better ways to do that in practice. The most effective way that is generally used is to not use a fixed chunk-size, but use a terminating character, together with a minimum/maximum chunksize. There's a pretty natural terminating character that works well for sources: '\n'. So the natural way to do similarity detection when most of the code is line-based is to do the hashing on chunks that follow the rule "minimum of <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n". So if you don't see any '\n' at all (or the only such one is less than <n> bytes into your current window), do the hash over a <2n>-byte chunk (this takes care of binaries and/or long lines). This - for source code - allows you to ignore trivial byte offset things, because you have a character that is used for synchronization. So you don't need to do hashing at every byte in both files - you end up doing the hashing only at line boundaries in practice. And it still _works_ for binary files, although you effectively need bigger identical chunk-sizes to find similarities (for text-files, it finds similarities of size <n>, for binaries the similarities need to effectively be of size 3*n, because you chunk it up at ~2*n, and only generate the hash at certain offsets in the source binary). Linus - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html