Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames

Derrick Stolee <stolee@xxxxxxxxx> · Sun, 7 Feb 2021 09:38:37 -0500

On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@xxxxxxxxx>
> 
> Make use of the new find_basename_matches() function added in the last
> two patches, to find renames more rapidly in cases where we can match up
> files based on basenames.

This is a valuable heuristic.

> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28),
> this change improves the performance as follows:
> 
>                             Before                  After
>     no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
>     mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
>     just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s

These numbers are very impressive.

Before I get too deep into reviewing these patches, I do want
to make it clear that the speed-up is coming at the cost of
a behavior change. We are restricting the "best match" search
to be first among files with common base name (although maybe
I would use 'suffix'?). If we search for a rename among all
additions and deletions ending the ".txt" we might find a
similarity match that is 60% and declare that a rename, even
if there is a ".txt" -> ".md" pair that has a 70% match.

This could be documented in a test case, to demonstrate that
we are making this choice explicitly.

For example, here is a test that passes now, but would
start failing with your patches (if I understand them
correctly):

diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41..e4c71fcf3be 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,21 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'multiple similarity choices' '
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >delete.txt &&
+	git add delete.txt &&
+	git commit -m "base txt" &&
+
+	rm delete.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >add.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >add.md &&
+	git add add.txt add.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -M HEAD HEAD^ >actual &&
+	grep "add.md	delete.txt" actual
+'
+
 test_done

Personally, I'm fine with making this assumption. All of
our renames are based on heuristics, so any opportunity
to reduce the number of content comparisons is a win in
my mind. We also don't report a rename unless there _is_
an add/delete pair that is sufficiently close in content.

So, in this way, we are changing the optimization function
that is used to determine the "best" rename available. It
might be good to update documentation for how we choose
renames:

  An add/delete pair is marked as a rename based on the
  following similarity function:

  1. If the blob content is identical, then those files
     are marked as a rename. (Should we break ties here
     based on the basename?)

  2. Among pairs whose content matches the minimum
     similarity limit, we optimize for:

     i. among files with the same basename (trailer
        after final '.') select pairs with highest
        similarity.

    ii. if no files with the same basename have the
        minimum similarity, then select pairs with
        highest similarity across all filenames.

The above was written quickly as an attempt, so it will
require careful editing to actually make sense to end
users.

Thanks,
-Stolee