This series depends on ort-perf-batch-6[1]. This series uses file basenames in a basic fashion to guide rename detection. Changes since v1: * Significant edits to the commit messages to make them explain the basic idea better and (hopefully) prevent misunderstandings. * Add a commit at the end that updates the documentation and includes a new testcase * Modify the code to make it clearer that it can already handle using a different score for basename comparison similarity, even if we don't use that ability yet (see below) Changes not yet included -- I need input about what is wanted: * Stolee suggested not creating a separate score for basename comparisons, at least not yet. Junio suggested it may be prudent to use a higher score for that than whatever -M option the user provided for normal comparisons...but didn't suggest whether it should be a separate user-specified option, or some kind of weighted average of the -M option and MAX_SCORE (e.g. use 60% if -M is 50%, or use 80% if -M is 75%). I tweaked the code to make it clearer that it already is able to handle such a score difference, but I'm not sure what whether we want an automatically computed higher value or a user-controlled possibly higher value. [1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@xxxxxxxxxxxxxxxxxxxxxx/ [2] https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf Elijah Newren (4): diffcore-rename: compute basenames of all source and dest candidates diffcore-rename: complete find_basename_matches() diffcore-rename: guide inexact rename detection based on basenames gitdiffcore doc: mention new preliminary step for rename detection Documentation/gitdiffcore.txt | 15 +++ diffcore-rename.c | 185 +++++++++++++++++++++++++++++++++- t/t4001-diff-rename.sh | 24 +++++ 3 files changed, 220 insertions(+), 4 deletions(-) base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/843 Range-diff vs v1: 1: 377a4a39fa86 ! 1: 381a45d239bb diffcore-rename: compute basenames of all source and dest candidates @@ Commit message diffcore-rename: compute basenames of all source and dest candidates We want to make use of unique basenames to help inform rename detection, - so that more likely pairings can be checked first. Add a new function, + so that more likely pairings can be checked first. (src/moduleA/foo.txt + and source/module/A/foo.txt are likely related if there are no other + 'foo.txt' files among the deleted and added files.) Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up. Non-unique basenames still show up in the map, but have an invalid index (-1). + This function was inspired by the fact that in real world repositories, + most renames often do not involve a basename change. Here are some + sample repositories and the percentage of their historical renames (as of + early 2020) that did not involve a basename change: + * linux: 76% + * gcc: 64% + * gecko: 79% + * webkit: 89% + Signed-off-by: Elijah Newren <newren@xxxxxxxxx> ## diffcore-rename.c ## 2: 5fb4493247ff ! 2: dcd0175229aa diffcore-rename: complete find_basename_matches() @@ Commit message sufficiently similar, we record the rename; if not, we include those files in the more exhaustive matrix comparison. + This means we are adding a set of preliminary additional comparisons, + but for each file we only compare it with at most one other file. For + example, if there was a include/media/device.h that was deleted and a + src/module/media/device.h that was added, and there were no other + device.h files added or deleted between the commits being compared, + then these two files would be compared in the preliminary step. + + This commit does not yet actually employ this new optimization, it + merely adds a function which can be used for this purpose. The next + commit will do the necessary plumbing to make use of it. + Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename. I think - that is okay for four reasons: (1) That seems somewhat unlikely in - practice, (2) it's easy to explain to the users what happened if it does - ever occur (or even for them to intuitively figure out), and (3) as the - next patch will show it provides such a large performance boost that - it's worth the tradeoff. Reason (4) takes a full paragraph to + that is okay for four reasons: (1) it's easy to explain to the users + what happened if it does ever occur (or even for them to intuitively + figure out), (2) as the next patch will show it provides such a large + performance boost that it's worth the tradeoff, and (3) it's somewhat + unlikely that despite having unique matching basenames that other files + serve as better matches. Reason (4) takes a full paragraph to explain... If the previous three reasons aren't enough, consider what rename @@ Commit message basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates. - We do not use this heuristic together with either break or copy - detection. The point of break detection is to say that filename - similarity does not imply file content similarity, and we only want to - know about file content similarity. The point of copy detection is to - use more resources to check for additional similarities, while this is - an optimization that uses far less resources but which might also result - in finding slightly fewer similarities. So the idea behind this - optimization goes against both of those features, and will be turned off - for both. - - Note that this optimization is not yet effective in any situation, as - the function is still unused. The next commit will hook it into the - code so that it is used when rename detection is wanted, but neither - copy nor break detection are. - Signed-off-by: Elijah Newren <newren@xxxxxxxxx> ## diffcore-rename.c ## 3: 1d941c35076e ! 3: ce2173aa1fb7 diffcore-rename: guide inexact rename detection based on basenames @@ Commit message Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up - files based on basenames. + files based on basenames. As a quick reminder (see the last two commit + messages for more details), this means for example that + docs/extensions.txt and docs/config/extensions.txt are considered likely + renames if there are no 'extensions.txt' files elsewhere among the added + and deleted files, and if a similarity check confirms they are similar, + then they are marked as a rename without looking for a better similarity + match among other files. This is a behavioral change, as covered in + more detail in the previous commit message. + + We do not use this heuristic together with either break or copy + detection. The point of break detection is to say that filename + similarity does not imply file content similarity, and we only want to + know about file content similarity. The point of copy detection is to + use more resources to check for additional similarities, while this is + an optimization that uses far less resources but which might also result + in finding slightly fewer similarities. So the idea behind this + optimization goes against both of those features, and will be turned off + for both. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull after exact", options->repo); + } else { ++ /* Determine minimum score to match basenames */ ++ int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5; ++ + /* + * Cull sources: + * - remove ones involved in renames (found via exact match) @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + + /* Utilize file basenames to quickly find renames. */ + trace2_region_enter("diff", "basename matches", options->repo); -+ rename_count += find_basename_matches(options, minimum_score, ++ rename_count += find_basename_matches(options, ++ min_basename_score, + rename_src_nr); + trace2_region_leave("diff", "basename matches", options->repo); + -: ------------ > 4: a0e75d8cd6bd gitdiffcore doc: mention new preliminary step for rename detection -- gitgitgadget