On Sun, 12 Mar 2006, Linus Torvalds wrote: > > Now, that said, they _both_ find some pretty funky renames. I think there > is probably some serious room for improvement, regardless (or at least > changing the default similarity cut-off to something better ;) I'm afraid that _good_ rename detection really ends up wanting to take "longest possible sequence" into account, exactly like the full xdelta does. Instead of doing a fixed-chunk thing and saying that any copy is equivalent to any other copy. That's simply not true. It's _much_ better to have one 24-byte copy than it is to have three 8-byte copies, but the new faster diffcore-delta.c just can't see that. So one big reason as to why it is fast in the first place is that it fundamentally just doesn't do a very good job ;( It might be that the fast delta thing is a good way to ask "is this even worth considering", to cut down the O(m*n) rename/copy detection to something much smaller, and then use xdelta() to actually figure out what is a good rename and what isn't from a much smaller set of potential targets. That would actually allow us to be even _less_ precise. Screw that big hash-table etc, don't even try to be exact. Just try to be fairly fast, and then pick the top entries from the similarity array for more precise diffing if there are multiple choices that look like they might be possible. The appended alternate "diffcore-delta.c" doesn't do any of the caching (ie I wrote it so that it would be easy to change to make the _caller_ keeps "src" constant, and iterates over destination - or the other way around - and would do the hash setup just once per src). Still, even with the existing setup, it's pretty fast for me (not much slower than your caching version even though it recalculates everything every time). And it's not that far off, which tells me that if it was used as a "first-pass filter", we could afford to do a better job on the things that it says are likely candidates. Hmm? It really does bother me how the suggested rename detector finds stuff that clearly isn't. Linus ---- #include "cache.h" #include "diff.h" #include "diffcore.h" #define CHUNK (16) #define SILLYSIZE (65537) static int hashnr[SILLYSIZE]; static void setup_hash(void) { memset(hashnr, 0, sizeof(hashnr)); } static void insert_hash(unsigned int hashval) { hashval = hashval % SILLYSIZE; hashnr[hashval]++; } static int find_hash(unsigned int hashval) { hashval = hashval % SILLYSIZE; if (hashnr[hashval]) { hashnr[hashval]--; return 1; } return 0; } int diffcore_count_changes(void *src, unsigned long src_size, void *dst, unsigned long dst_size, void **src_count_p, void **dst_count_p, unsigned long delta_limit, unsigned long *src_copied, unsigned long *literal_added) { unsigned long copied = 0; unsigned long literal = 0; setup_hash(); while (src_size >= CHUNK) { unsigned int hashval = adler32(0, src, CHUNK); insert_hash(hashval); src += CHUNK; src_size -= CHUNK; } while (dst_size >= CHUNK) { unsigned int hashval = adler32(0, dst, CHUNK); if (find_hash(hashval)) { copied += CHUNK; dst += CHUNK; dst_size -= CHUNK; continue; } literal++; if (literal > delta_limit) return -1; dst++; dst_size--; } literal += dst_size; *src_copied = copied; *literal_added = literal; return 0; } - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html