This series depends on ort-perf-batch-6[1]. This series uses file basenames (portion of the path after final '/', including extension) in a basic fashion to guide rename detection. Changes since v2: * insert a new patch at the beginning to test expected old behavior, then have later patches change the test expectation * tweak commit message wording to clarify that rename stats merely motivated the optimization * factor out a simple get_basename() helper; have it document why we don't use gitbasename() * use a higher required threshold to mark same-basename files as a rename, defaulting to average of minimum_score and MAX_SCORE (since default rename threshold is 50%, this implies default basename threshold is 75%) * provide an environment variable (undocumented) that can be used to test appropriate threshold for basename-sameness * updated the timings based on the new threshold * modify the documentation with Junio's suggested simpler and clearer wording * clean up the wording on a few comments [1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@xxxxxxxxxxxxxxxxxxxxxx/ [2] https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf Elijah Newren (5): t4001: add a test comparing basename similarity and content similarity diffcore-rename: compute basenames of all source and dest candidates diffcore-rename: complete find_basename_matches() diffcore-rename: guide inexact rename detection based on basenames gitdiffcore doc: mention new preliminary step for rename detection Documentation/gitdiffcore.txt | 17 +++ diffcore-rename.c | 202 +++++++++++++++++++++++++++++++++- t/t4001-diff-rename.sh | 24 ++++ 3 files changed, 239 insertions(+), 4 deletions(-) base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/843 Range-diff vs v2: 4: a0e75d8cd6bd ! 1: 3e6af929d135 gitdiffcore doc: mention new preliminary step for rename detection @@ Metadata Author: Elijah Newren <newren@xxxxxxxxx> ## Commit message ## - gitdiffcore doc: mention new preliminary step for rename detection + t4001: add a test comparing basename similarity and content similarity - The last few patches have introduced a new preliminary step when rename - detection is on but both break detection and copy detection are off. - Document this new step. While we're at it, add a testcase that checks - the new behavior as well. + Add a simple test where a removed file is similar to two different added + files; one of them has the same basename, and the other has a slightly + higher content similarity. Without break detection, filename similarity + of 100% trumps content similarity for pairing up related files. For + any filename similarity less than 100%, the opposite is true -- content + similarity is all that matters. Add a testcase that documents this. - Signed-off-by: Elijah Newren <newren@xxxxxxxxx> + Subsequent commits will add a new rule that includes an inbetween state, + where a mixture of filename similarity and content similarity are + weighed, and which will change the outcome of this testcase. - ## Documentation/gitdiffcore.txt ## -@@ Documentation/gitdiffcore.txt: a similarity score different from the default of 50% by giving a - number after the "-M" or "-C" option (e.g. "-M8" to tell it to use - 8/10 = 80%). - -+Note that when rename detection is on but both copy and break -+detection are off, rename detection adds a preliminary step that first -+checks files with the same basename. If files with the same basename -+are sufficiently similar, it will mark them as renames and exclude -+them from the later quadratic step (the one that pairwise compares all -+unmatched files to find the "best" matches, determined by the highest -+content similarity). So, for example, if docs/extensions.txt and -+docs/config/extensions.txt have similar content, then they will be -+marked as a rename even if it turns out that docs/extensions.txt was -+more similar to src/extension-checks.c. At most, one comparison is -+done per file in this preliminary pass; so if there are several -+extensions.txt files throughout the directory hierarchy that were -+added and deleted, this preliminary step will be skipped for those -+files. -+ - Note. When the "-C" option is used with `--find-copies-harder` - option, 'git diff-{asterisk}' commands feed unmodified filepairs to - diffcore mechanism as well as modified ones. This lets the copy + Signed-off-by: Elijah Newren <newren@xxxxxxxxx> ## t/t4001-diff-rename.sh ## @@ t/t4001-diff-rename.sh: test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' @@ t/t4001-diff-rename.sh: test_expect_success 'diff-tree -l0 defaults to a big ren + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, + # but since same basenames are checked first... + cat >expected <<-\EOF && -+ A file.md -+ R078 subdir/file.txt file.txt ++ R088 subdir/file.txt file.md ++ A file.txt + EOF + test_cmp expected actual +' 1: 381a45d239bb ! 2: 4fff9b1ff57b diffcore-rename: compute basenames of all source and dest candidates @@ Commit message basenames still show up in the map, but have an invalid index (-1). This function was inspired by the fact that in real world repositories, - most renames often do not involve a basename change. Here are some - sample repositories and the percentage of their historical renames (as of - early 2020) that did not involve a basename change: + files are often moved across directories without changing names. Here + are some sample repositories and the percentage of their historical + renames (as of early 2020) that preserved basenames: * linux: 76% * gcc: 64% * gecko: 79% * webkit: 89% + These statistics alone don't prove that an optimization in this area + will help or how much it will help, since there are also unpaired adds + and deletes, restrictions on which basenames we consider, etc., but it + certainly motivated the idea to try something in this area. Signed-off-by: Elijah Newren <newren@xxxxxxxxx> @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options) return renames; } ++static const char *get_basename(const char *filename) ++{ ++ /* ++ * gitbasename() has to worry about special drivers, multiple ++ * directory separator characters, trailing slashes, NULL or ++ * empty strings, etc. We only work on filenames as stored in ++ * git, and thus get to ignore all those complications. ++ */ ++ const char *base = strrchr(filename, '/'); ++ return base ? base + 1 : filename; ++} ++ +MAYBE_UNUSED +static int find_basename_matches(struct diff_options *options, + int minimum_score, @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options) + strintmap_init_with_options(&dests, -1, NULL, 0); + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; -+ char *base; ++ const char *base; + + /* exact renames removed in remove_unneeded_paths_from_src() */ + assert(!rename_src[i].p->one->rename_used); + -+ base = strrchr(filename, '/'); -+ base = (base ? base+1 : filename); -+ + /* Record index within rename_src (i) if basename is unique */ ++ base = get_basename(filename); + if (strintmap_contains(&sources, base)) + strintmap_set(&sources, base, -1); + else @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options) + } + for (i = 0; i < rename_dst_nr; ++i) { + char *filename = rename_dst[i].p->two->path; -+ char *base; ++ const char *base; + + if (rename_dst[i].is_rename) + continue; /* involved in exact match already. */ + -+ base = strrchr(filename, '/'); -+ base = (base ? base+1 : filename); -+ + /* Record index within rename_dst (i) if basename is unique */ ++ base = get_basename(filename); + if (strintmap_contains(&dests, base)) + strintmap_set(&dests, base, -1); + else 2: dcd0175229aa ! 3: dc26881e4ed3 diffcore-rename: complete find_basename_matches() @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options { - int i; + /* -+ * When I checked, over 76% of file renames in linux just moved -+ * files to a different directory but kept the same basename. gcc -+ * did that with over 64% of renames, gecko did it with over 79%, -+ * and WebKit did it with over 89%. ++ * When I checked in early 2020, over 76% of file renames in linux ++ * just moved files to a different directory but kept the same ++ * basename. gcc did that with over 64% of renames, gecko did it ++ * with over 79%, and WebKit did it with over 89%. + * + * Therefore we can bypass the normal exhaustive NxM matrix + * comparison of similarities between all potential rename sources -+ * and destinations by instead using file basename as a hint, checking -+ * for similarity between files with the same basename, and if we -+ * find a pair that are sufficiently similar, record the rename -+ * pair and exclude those two from the NxM matrix. ++ * and destinations by instead using file basename as a hint (i.e. ++ * the portion of the filename after the last '/'), checking for ++ * similarity between files with the same basename, and if we find ++ * a pair that are sufficiently similar, record the rename pair and ++ * exclude those two from the NxM matrix. + * + * This *might* cause us to find a less than optimal pairing (if + * there is another file that we are even more similar to but has a @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options + * basename matching provides, and given the frequency with which + * people use the same basename in real world projects, that's a + * trade-off we are willing to accept when doing just rename -+ * detection. However, if someone wants copy detection that -+ * implies they are willing to spend more cycles to find -+ * similarities between files, so it may be less likely that this -+ * heuristic is wanted. ++ * detection. ++ * ++ * If someone wants copy detection that implies they are willing to ++ * spend more cycles to find similarities between files, so it may ++ * be less likely that this heuristic is wanted. If someone is ++ * doing break detection, that means they do not want filename ++ * similarity to imply any form of content similiarity, and thus ++ * this heuristic would definitely be incompatible. + */ + + int i, renames = 0; @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options struct strintmap dests; + /* -+ * The prefeteching stuff wants to know if it can skip prefetching blobs -+ * that are unmodified. unmodified blobs are only relevant when doing -+ * copy detection. find_basename_matches() is only used when detecting -+ * renames, not when detecting copies, so it'll only be used when a file -+ * only existed in the source. Since we already know that the file -+ * won't be unmodified, there's no point checking for it; that's just a -+ * waste of resources. So set skip_unmodified to 0 so that -+ * estimate_similarity() and prefetch() won't waste resources checking -+ * for something we already know is false. ++ * The prefeteching stuff wants to know if it can skip prefetching ++ * blobs that are unmodified...and will then do a little extra work ++ * to verify that the oids are indeed different before prefetching. ++ * Unmodified blobs are only relevant when doing copy detection; ++ * when limiting to rename detection, diffcore_rename[_extended]() ++ * will never be called with unmodified source paths fed to us, so ++ * the extra work necessary to check if rename_src entries are ++ * unmodified would be a small waste. + */ + int skip_unmodified = 0; + @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options + /* Now look for basename matchups and do similarity estimation */ + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; -+ char *base = NULL; ++ const char *base = NULL; + intptr_t src_index; + intptr_t dst_index; + -+ /* Get the basename */ -+ base = strrchr(filename, '/'); -+ base = (base ? base+1 : filename); -+ + /* Find out if this basename is unique among sources */ ++ base = get_basename(filename); + src_index = strintmap_get(&sources, base); + if (src_index == -1) + continue; /* not a unique basename; skip it */ 3: ce2173aa1fb7 ! 4: 2493f4b2f55d diffcore-rename: guide inexact rename detection based on basenames @@ Commit message this change improves the performance as follows: Before After - no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s - mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s - just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s + no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s + mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s + just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s Signed-off-by: Elijah Newren <newren@xxxxxxxxx> ## diffcore-rename.c ## -@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options) - return renames; +@@ diffcore-rename.c: static const char *get_basename(const char *filename) + return base ? base + 1 : filename; } -MAYBE_UNUSED @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + trace2_region_leave("diff", "cull after exact", options->repo); + } else { + /* Determine minimum score to match basenames */ -+ int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5; ++ double factor = 0.5; ++ char *basename_factor = getenv("GIT_BASENAME_FACTOR"); ++ int min_basename_score; ++ ++ if (basename_factor) ++ factor = strtol(basename_factor, NULL, 10)/100.0; ++ assert(factor >= 0.0 && factor <= 1.0); ++ min_basename_score = minimum_score + ++ (int)(factor * (MAX_SCORE - minimum_score)); + + /* + * Cull sources: + * - remove ones involved in renames (found via exact match) + */ -+ trace2_region_enter("diff", "cull exact", options->repo); ++ trace2_region_enter("diff", "cull after exact", options->repo); + remove_unneeded_paths_from_src(want_copies); -+ trace2_region_leave("diff", "cull exact", options->repo); ++ trace2_region_leave("diff", "cull after exact", options->repo); + + /* Utilize file basenames to quickly find renames. */ + trace2_region_enter("diff", "basename matches", options->repo); @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST]; for (j = 0; j < NUM_CANDIDATE_PER_DST; j++) + + ## t/t4001-diff-rename.sh ## +@@ t/t4001-diff-rename.sh: test_expect_success 'basename similarity vs best similarity' ' + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, + # but since same basenames are checked first... + cat >expected <<-\EOF && +- R088 subdir/file.txt file.md +- A file.txt ++ A file.md ++ R078 subdir/file.txt file.txt + EOF + test_cmp expected actual + ' -: ------------ > 5: fc72d24a3358 gitdiffcore doc: mention new preliminary step for rename detection -- gitgitgadget