Linus Torvalds <torvalds@xxxxxxxx> writes: > Mine is a bit less hacky than yours, I believe. It doesn't skip > whitespace, instead it just maintains a rolling 64-bit number, where each > character shifts it left by 7 and then adds in the new character value > (overflow in 32 bits just ignored). That rolling register is a good idea. The "whitespace hack" was done to recognize certain kind of changes that commonly appear in source code. For example, it will still recognize content copies after you re-indent your code, or add an "if (...) {" and "} else { ... }" around an existing code block, or add extra blank lines. It is still an inadequate hack. If you comment out a code block by adding "#if 0" and "#endif" around it, it notices the surviving lines, but if instead you comment out a block by prefixing "//" in front of every line in the block, neither your 64-byte-or-EOL or my extended line algorithm would notice that the content copy anymore. Anyway, I did a bit of comparison and it appears that the whitespace thing does not make much difference in practice. > It's fast and stupid, but doesn't seem to do any worse than your old one. Comparing the "next" with your 64-byte-or-EOL and "extended line" on the v2.6.12..v2.6.14 test case shows: 64-or-EOL extended line renames identically detected 108 110 matched differently 2 2 finds what"next" misses 4 4 misses what "next" finds 23 21 What they find seem reasonable. What they reject are sometimes debatable. For example, similarity between these two files does not seem to be noticed by either. v2.6.12/drivers/media/dvb/dibusb/dvb-dibusb-firmware.c v2.6.14/drivers/media/dvb/dvb-usb/dvb-usb-firmware.c The "next" algorithm gives 60% score while these two gives 45% or so to this pair. But they both reject these bogus "rename" the "next" algorithm finds: v2.6.12/drivers/char/drm/gamma_drv.c v2.6.14/drivers/char/drm/via_verifier.h ("next" 51% vs 37-40% with these algorithms). > Anyway, I don't think something like this is really any good for rename > detection, but it might be good for deciding whether to do a real delta. Either algorithm seem to have non-negligible false negative rates but their false positive rates are reasonably low. So we could use these as a pre-filter and use real delta on pairs that these quick and dirty algorithms say are too different. - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html