[cc'd Junio for comments on this rename optimization] On Thu, May 01, 2008 at 11:39:40PM +0300, Teemu Likonen wrote: > > Hmm, looking at the code, though, 50% is supposed to be the default > > minimum. So there might actually be a bug. > > I did some testing... A file, containing 10 lines (about 200 bytes), > renamed and then modified (similarity index being a bit over 50%). Git Ah, OK. The problem comes because the toy example is so tiny. It hits this code chunk: if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE) return 0; where base_size is the size of the smaller file in bytes, and delta_size is the difference between the size of the two files. This is an optimization so that we don't even have to look at the contents. But it is basing the percentage off of the smaller file, so even though file B ("hello\nworld\n") is 50% made up of file A ("hello\n"), we actually end up saying "there must be at least as much content added to make B as there is in A already". IOW, the "percentage similarity" is based off of the smaller file for this optimization. Obviously this is a toy case, but I wonder if there are other larger cases where you end up with a file which has substantial copied content, but also _grows_ a lot (not just changes). For example, consider the file: 1 2 3 4 5 6 7 8 9 that is, ten lines each with a number. Now rename it, and start adding more numbers. We detect the addition of 10, 11, 12. But adding 13 means we no longer match. So even with only 4 lines added, we fail to match. But again, this is a bit of a toy case. It relies on the line length being a significant factor compared to number of lines. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html