Re: Wrong damage counting in diffcore_count_changes?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Fri, 4 Dec 2009, Junio C Hamano wrote:
> 
> Well, the current loop structure largely comes from your eb4d0e3 (optimize
> diffcore-delta by sorting hash entries., 2007-10-02) so you would be the
> best judge of the change ;-), even though it seems that the current code
> inherited the "skipping of dst when src does not exist" bug from c06c796
> (diffcore-rename: somewhat optimized., 2006-03-12).

Yeah, I think the sorting thing tried to not change any logic, so the 
counting predates that whole thing.

> But my lunchbreak is over, and my evening is booked, so I unfortunately
> cannot spend more time thinking about any possible fallouts from this
> change until tomorrow.
> 
> Sorry, and thanks.

No problem. Just an example of the fallout here on the kernel:

 - totally made-up trivial differences in two different kernel 
   directories:

	[torvalds@nehalem linux]$ git diff -p --stat >
	 kernel/sched.c |    1 +
	 mm/memory.c    |    1 +
	 2 files changed, 2 insertions(+), 0 deletions(-)
	
	diff --git a/kernel/sched.c b/kernel/sched.c
	index 3c11ae0..7a86f4f 100644
	--- a/kernel/sched.c
	+++ b/kernel/sched.c
	@@ -1,3 +1,4 @@
	+123
	 /*
	  *  kernel/sched.c
	  *
	diff --git a/mm/memory.c b/mm/memory.c
	index 6ab19dd..0de58a6 100644
	--- a/mm/memory.c
	+++ b/mm/memory.c
	@@ -1,3 +1,4 @@
	+aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
	 /*
	  *  linux/mm/memory.c
	  *

 - Here's what current git reports (which is obviously garbage):

	[torvalds@nehalem linux]$ git diff --dirstat
	 100.0% kernel/

 - Here's what a fixed git with that patch reports:

	[torvalds@nehalem linux]$ ~/git/git diff --dirstat
	   8.6% kernel/
	  91.3% mm/

and notice how the fixed diff now sees the change to mm/memory.c as a real 
change (it used to dismiss it entirely because it was a new previously 
non-existent pattern, so the hash didn't exist in the source), and gives 
reasonable percentages as to how much damage has been done (ie the 
mm/memory.c changes were obviously bigger: 42 new characters vs 4 new 
characters).

So the patch definitely improves dirstat, although for a rather made-up 
example (I normally use it for much bigger diffs, where the impact of this 
bug is not nearly as noticeable).

But that patch also changes the end result (in major ways) for real 
examples too - probably exactly because it undercounted additions of new 
code. I can't prove that the new numbers are "better", but I think they 
are:

 - old and presumably broken:

	[torvalds@nehalem linux]$ git diff -M --dirstat --cumulative v2.6.32-rc8..v2.6.32
	   5.8% arch/
	   3.5% drivers/net/e1000e/
	   9.6% drivers/net/
	   3.9% drivers/staging/
	  34.8% drivers/
	   4.1% fs/cachefiles/
	  24.9% fs/fscache/
	  32.6% fs/
	  14.5% kernel/
	   3.7% net/

 - new and hopefully fixed:

	[torvalds@nehalem linux]$ ~/git/git diff -M --dirstat --cumulative v2.6.32-rc8..v2.6.32
	   3.3% Documentation/filesystems/caching/
	   3.5% Documentation/filesystems/
	   7.6% Documentation/
	   3.2% arch/arm/
	   4.2% arch/blackfin/
	  10.1% arch/
	   3.1% drivers/gpu/drm/
	   7.8% drivers/net/
	   3.1% drivers/staging/
	  30.0% drivers/
	   5.6% fs/cachefiles/
	  19.9% fs/fscache/
	  28.4% fs/
	   3.1% include/
	  12.9% kernel/

so it really does seem like the old code is crud. It just never really 
mattered, because from a "is this a copy" standpoint, we don't care all 
that much about the "added" content, we care mostly about the original 
size and how much of it still remains.

(Sanity check: the diffstat for that thing says:

 277 files changed, 4426 insertions(+), 1244 deletions(-)

and diffstat for just Documentation/ says

 5 files changed, 289 insertions(+), 14 deletions(-)

so you'd expect that the Documentation changes would be at _least_ 
(289+14)/(4426+1244), ie ~5%, and since text documentation lines tend to 
be more dense than actual code (with lines with just curly braces etc), I 
do think that 7.6% Documentation/ sounds much more likely than <3% 
(invisible).

I also did a "git log -M --summary" on the current kernel git tree, and it 
didn't actually change any of _that_. So it seems that rename detection 
still ends up spitting out the same numbers.

Which is actually not surprising, because rename detection doesn't even 
end up _using_ the "literal_added" part at all (it just does:

	score = (int)(src_copied * MAX_SCORE / max_size);

ie it bases it's score on the amount of copied data, scaling it by the 
bigger of the two src/dst sizes).

So just fixing the "literal added" count should not mess anything up, and 
seems to fix dirstat.

It also looks a bit like diffcore-break actually worked around this whole 
thing, and does

        /* sanity */
        if (src->size < src_copied)
                src_copied = src->size;
        if (dst->size < literal_added + src_copied) {
                if (src_copied < dst->size)
                        literal_added = dst->size - src_copied;   
                else
                        literal_added = 0;
        } 
        src_removed = src->size - src_copied;

so this _may_ change what -B does, but I get the feeling that it should 
improve that too. I'm running a before-and-after "git log -M -B --summary" 
on the kernel now, but it's a pretty expensive operation, so it hasn't 
finished yet.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]