Re: Bug: impure renames may be reported as pure renames

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 21 Feb 2024 09:11:12 -0800

Elijah Newren <newren@xxxxxxxxx> writes:

> Heh, good point.  And more generally, due to how the similarity checks
> work (split the file into lines/chunks, hash each to an integer, then
> sort the list of integers and compare the list of integers between two
> files), whenever you keep all the original lines of a file but permute
> their order, you will see a 100% match.

Stepping back a bit, there internally is a distinction between
"identical files" and "similar but not identical files".  The
file with permuted lines are not identical but are very similar.

The decision not to surface these two to end-users was made long
ago, and after finding identical renames, we skip similarity
computation when minimum_score is set to -M100 from the command
line, which is a documented UI that cannot change.

So you are right to say that similarity index of an inexact rename
should be capped at 99 and never reach 100.  I do not know offhand
if your "how about" computation ...

>  static int similarity_index(struct diff_filepair *p)
>  {
> -       return p->score * 100 / MAX_SCORE;
> +       return p->score * 100 / (MAX_SCORE+1);
>  }

... is the way to go, though.  We have filepair, so shouldn't we be
doing more like

	if (identical_rename(p))
		return 100;
	else
		return 99 * p->score / MAX_SCORE;

instead?

It will also keep working when a minor pet peeve of mine is fixed.
If you have a file with 1000 lines, all hashing to distinct values,
and if you change only one line, even though the similarity is
computed as 99% by the current logic, it actually is 99.9% that is
much closer to 100%.  But we always round down, which feels wrong.

Alternatively, we could cap p->score to (MAX_SCORE-1) when
similarity is _computed_ for filepair between non-identical blobs.
That should happen in diffcore-rename.c:estimate_similarity().
It would allow us to still _show_ "1 line changed among 1000 lines"
case as R100 and still reject inexact renames via "-M100" from the
command line, I think, as the "exact renames only" short-cut works
with the finer-grained "score" values and not the coarser "percent"
values, which might give us a better approach.