Re: Bug: impure renames may be reported as pure renames

Elijah Newren <newren@xxxxxxxxx> · Tue, 20 Feb 2024 18:12:00 -0800

On Tue, Feb 20, 2024 at 12:58 PM Mário Guimarães
<mario.luis.guimaraes@xxxxxxxxx> wrote:
>
> Hello Git community,
>
> please see the report below of what may be a bug.
>
> Yours sincerely
> Mário Guimarães
>
> ======================================================
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.
>
> What did you do before the bug happened? (Steps to reproduce your issue)
>
> In the rust-lang/rust repository (I cloned today from GitHub), if we
> run the command
>
>     git diff-tree -r -M a04e649^2 a04e649 --
> tests/ui/issues/issue-83190.rs
> tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>
> we get this result
>
>     :100644 100644 da931c3edaf6f9de6805e771f2b3b28edd27001f
> 11b659eec97323ea5190dac1771c7ca3241861e7 R100
> tests/ui/issues/issue-83190.rs
> tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>
> However, the source and destination files of the rename are not 100%
> equal. If we run this other command
>
>     git diff -M a04e649^2 a04e649 -- tests/ui/issues/issue-83190.rs
> tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>
> we get the following result
>
>     diff --git a/tests/ui/issues/issue-83190.rs
> b/tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>     similarity index 100%
>     rename from tests/ui/issues/issue-83190.rs
>     rename to tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>     index da931c3edaf..11b659eec97 100644
>     --- a/tests/ui/issues/issue-83190.rs
>     +++ b/tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>     @@ -1,7 +1,7 @@
>     -// check-pass
>     -
>      // Regression test for issue #83190, triggering an ICE in borrowck.
>
>     +// check-pass
>     +
>      pub trait Any {}
>      impl<T> Any for T {}

Heh, good point.  And more generally, due to how the similarity checks
work (split the file into lines/chunks, hash each to an integer, then
sort the list of integers and compare the list of integers between two
files), whenever you keep all the original lines of a file but permute
their order, you will see a 100% match.

Maybe a simple hack like the below would suffice to fix (untested --
anyone want to test it out for me?):

diff --git a/diff.c b/diff.c
index a2def45644b..6b7b3a8b9af 100644
--- a/diff.c
+++ b/diff.c
@@ -4405,7 +4405,7 @@ static void run_external_diff(const char *pgm,

 static int similarity_index(struct diff_filepair *p)
 {
-       return p->score * 100 / MAX_SCORE;
+       return p->score * 100 / (MAX_SCORE+1);
 }

 static const char *diff_abbrev_oid(const struct object_id *oid, int abbrev)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 5a6e2bcac71..3228a898e0b 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -314,7 +314,7 @@ static int find_identical_files(struct hashmap *srcs,
                        break;
        }
        if (best) {
-               record_rename_pair(dst_index, best->index, MAX_SCORE);
+               record_rename_pair(dst_index, best->index, MAX_SCORE+1);
                renames++;
        }
        return renames;

Which is based on the premise that 100% of source lines can match 100%
of destination lines and result in a score of MAX_SCORE, so we
manually set the scoring to MAX_SCORE+1 when we detect exact renames
based on matching hashes and then use that higher value as the divisor
when computing percentages.  (Note that there are several other places
where "MAX_SCORE" is used and which I'm pretty sure we would _not_
want to replace with "MAX_SCORE+1", so if you don't like my patch and
want to redefine MAX_SCORE, then all those other locations would need
to be adjusted instead.)