Re: Fix up diffcore-rename scoring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> Now, that said, they _both_ find some pretty funky renames. I think there 
> is probably some serious room for improvement, regardless (or at least 
> changing the default similarity cut-off to something better ;)

I'm afraid that _good_ rename detection really ends up wanting to take 
"longest possible sequence" into account, exactly like the full xdelta 
does. 

Instead of doing a fixed-chunk thing and saying that any copy is 
equivalent to any other copy. That's simply not true. It's _much_ better 
to have one 24-byte copy than it is to have three 8-byte copies, but the 
new faster diffcore-delta.c just can't see that.

So one big reason as to why it is fast in the first place is that it 
fundamentally just doesn't do a very good job ;(

It might be that the fast delta thing is a good way to ask "is this even 
worth considering", to cut down the O(m*n) rename/copy detection to 
something much smaller, and then use xdelta() to actually figure out what 
is a good rename and what isn't from a much smaller set of potential 
targets.

That would actually allow us to be even _less_ precise. Screw that big 
hash-table etc, don't even try to be exact. Just try to be fairly fast, 
and then pick the top entries from the similarity array for more precise 
diffing if there are multiple choices that look like they might be 
possible.

The appended alternate "diffcore-delta.c" doesn't do any of the caching 
(ie I wrote it so that it would be easy to change to make the _caller_ 
keeps "src" constant, and iterates over destination - or the other way 
around - and would do the hash setup just once per src).

Still, even with the existing setup, it's pretty fast for me (not much 
slower than your caching version even though it recalculates everything 
every time). And it's not that far off, which tells me that if it was used 
as a "first-pass filter", we could afford to do a better job on the things 
that it says are likely candidates.

Hmm? It really does bother me how the suggested rename detector finds 
stuff that clearly isn't. 

			Linus

----
#include "cache.h"
#include "diff.h"
#include "diffcore.h"

#define CHUNK (16)
#define SILLYSIZE (65537)
static int hashnr[SILLYSIZE];

static void setup_hash(void)
{
	memset(hashnr, 0, sizeof(hashnr));
}

static void insert_hash(unsigned int hashval)
{
	hashval = hashval % SILLYSIZE;
	hashnr[hashval]++;
}

static int find_hash(unsigned int hashval)
{
	hashval = hashval % SILLYSIZE;
	if (hashnr[hashval]) {
		hashnr[hashval]--;
		return 1;
	}
	return 0;
}

int diffcore_count_changes(void *src, unsigned long src_size,
			   void *dst, unsigned long dst_size,
			   void **src_count_p,
			   void **dst_count_p,
			   unsigned long delta_limit,
			   unsigned long *src_copied,
			   unsigned long *literal_added)
{
	unsigned long copied = 0;
	unsigned long literal = 0;

	setup_hash();
	while (src_size >= CHUNK) {
		unsigned int hashval = adler32(0, src, CHUNK);
		insert_hash(hashval);
		src += CHUNK;
		src_size -= CHUNK;
	}

	while (dst_size >= CHUNK) {
		unsigned int hashval = adler32(0, dst, CHUNK);
		if (find_hash(hashval)) {
			copied += CHUNK;
			dst += CHUNK;
			dst_size -= CHUNK;
			continue;
		}
		literal++;
		if (literal > delta_limit)
			return -1;
		dst++;
		dst_size--;
	}
	literal += dst_size;

	*src_copied = copied;
	*literal_added = literal;
	return 0;
}
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]