Re: detecting rename->commit->modify->commit

Jeff King <peff@xxxxxxxx> · Thu, 1 May 2008 19:09:25 -0400

[cc'd Junio for comments on this rename optimization]

On Thu, May 01, 2008 at 11:39:40PM +0300, Teemu Likonen wrote:

> > Hmm, looking at the code, though, 50% is supposed to be the default
> > minimum. So there might actually be a bug.
> 
> I did some testing... A file, containing 10 lines (about 200 bytes),
> renamed and then modified (similarity index being a bit over 50%). Git

Ah, OK. The problem comes because the toy example is so tiny. It hits
this code chunk:

  if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
          return 0;

where base_size is the size of the smaller file in bytes, and delta_size
is the difference between the size of the two files. This is an
optimization so that we don't even have to look at the contents.

But it is basing the percentage off of the smaller file, so even though
file B ("hello\nworld\n") is 50% made up of file A ("hello\n"), we
actually end up saying "there must be at least as much content added to
make B as there is in A already". IOW, the "percentage similarity" is
based off of the smaller file for this optimization.

Obviously this is a toy case, but I wonder if there are other larger
cases where you end up with a file which has substantial copied content,
but also _grows_ a lot (not just changes). For example, consider the
file:

that is, ten lines each with a number. Now rename it, and start adding
more numbers. We detect the addition of 10, 11, 12. But adding 13 means
we no longer match. So even with only 4 lines added, we fail to match.

But again, this is a bit of a toy case. It relies on the line length
being a significant factor compared to number of lines.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html