Re: git-diff-tree -M performance regression in 'next'

Junio C Hamano <junkio@xxxxxxx> · Tue, 14 Mar 2006 02:26:22 -0800

Linus Torvalds <torvalds@xxxxxxxx> writes:

> Mine is a bit less hacky than yours, I believe. It doesn't skip 
> whitespace, instead it just maintains a rolling 64-bit number, where each 
> character shifts it left by 7 and then adds in the new character value 
> (overflow in 32 bits just ignored).

That rolling register is a good idea.  The "whitespace hack" was
done to recognize certain kind of changes that commonly appear
in source code.  For example, it will still recognize content
copies after you re-indent your code, or add an "if (...) {" and
"} else { ... }" around an existing code block, or add extra
blank lines.

It is still an inadequate hack.  If you comment out a code block
by adding "#if 0" and "#endif" around it, it notices the
surviving lines, but if instead you comment out a block by
prefixing "//" in front of every line in the block, neither your
64-byte-or-EOL or my extended line algorithm would notice that
the content copy anymore.

Anyway, I did a bit of comparison and it appears that the
whitespace thing does not make much difference in practice.

> It's fast and stupid, but doesn't seem to do any worse than your old one. 

Comparing the "next" with your 64-byte-or-EOL and "extended
line" on the v2.6.12..v2.6.14 test case shows:

				64-or-EOL	extended line
renames identically detected	108		110
matched differently		2		2
finds what"next" misses		4		4
misses what "next" finds	23		21

What they find seem reasonable.  What they reject are sometimes
debatable.  For example, similarity between these two files does
not seem to be noticed by either.

        v2.6.12/drivers/media/dvb/dibusb/dvb-dibusb-firmware.c
        v2.6.14/drivers/media/dvb/dvb-usb/dvb-usb-firmware.c

The "next" algorithm gives 60% score while these two gives 45%
or so to this pair.

But they both reject these bogus "rename" the "next" algorithm
finds:

	v2.6.12/drivers/char/drm/gamma_drv.c
	v2.6.14/drivers/char/drm/via_verifier.h

("next" 51% vs 37-40% with these algorithms). 

> Anyway, I don't think something like this is really any good for rename 
> detection, but it might be good for deciding whether to do a real delta.

Either algorithm seem to have non-negligible false negative
rates but their false positive rates are reasonably low.  So we
could use these as a pre-filter and use real delta on pairs that
these quick and dirty algorithms say are too different.

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html