On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@xxxxxxxxx> > > Make use of the new find_basename_matches() function added in the last > two patches, to find renames more rapidly in cases where we can match up > files based on basenames. This is a valuable heuristic. > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin > performance work; instrument with trace2_region_* calls", 2020-10-28), > this change improves the performance as follows: > > Before After > no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s > mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s > just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s These numbers are very impressive. Before I get too deep into reviewing these patches, I do want to make it clear that the speed-up is coming at the cost of a behavior change. We are restricting the "best match" search to be first among files with common base name (although maybe I would use 'suffix'?). If we search for a rename among all additions and deletions ending the ".txt" we might find a similarity match that is 60% and declare that a rename, even if there is a ".txt" -> ".md" pair that has a 70% match. This could be documented in a test case, to demonstrate that we are making this choice explicitly. For example, here is a test that passes now, but would start failing with your patches (if I understand them correctly): diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index c16486a9d41..e4c71fcf3be 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -262,4 +262,21 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' grep "myotherfile.*myfile" actual ' +test_expect_success 'multiple similarity choices' ' + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 line10 >delete.txt && + git add delete.txt && + git commit -m "base txt" && + + rm delete.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 >add.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 >add.md && + git add add.txt add.md && + git commit -a -m "rename" && + git diff-tree -M HEAD HEAD^ >actual && + grep "add.md delete.txt" actual +' + test_done Personally, I'm fine with making this assumption. All of our renames are based on heuristics, so any opportunity to reduce the number of content comparisons is a win in my mind. We also don't report a rename unless there _is_ an add/delete pair that is sufficiently close in content. So, in this way, we are changing the optimization function that is used to determine the "best" rename available. It might be good to update documentation for how we choose renames: An add/delete pair is marked as a rename based on the following similarity function: 1. If the blob content is identical, then those files are marked as a rename. (Should we break ties here based on the basename?) 2. Among pairs whose content matches the minimum similarity limit, we optimize for: i. among files with the same basename (trailer after final '.') select pairs with highest similarity. ii. if no files with the same basename have the minimum similarity, then select pairs with highest similarity across all filenames. The above was written quickly as an attempt, so it will require careful editing to actually make sense to end users. Thanks, -Stolee