On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@xxxxxxxxx> > > We want to make use of unique basenames to help inform rename detection, > so that more likely pairings can be checked first. (src/moduleA/foo.txt > and source/module/A/foo.txt are likely related if there are no other > 'foo.txt' files among the deleted and added files.) Add a new function, > not yet used, which creates a map of the unique basenames within > rename_src and another within rename_dst, together with the indices > within rename_src/rename_dst where those basenames show up. Non-unique > basenames still show up in the map, but have an invalid index (-1). > > This function was inspired by the fact that in real world repositories, > most renames often do not involve a basename change. Here are some > sample repositories and the percentage of their historical renames (as of > early 2020) that did not involve a basename change: I found this difficult to parse. Perhaps instead "the percentage of their renames that preserved basenames". We might also need something stronger, though: which percentage of renames preserved the basename but also had no other copy of that basename in the scope of all add/deletes? Is this reproducible from a shell command that could be documented here? > +MAYBE_UNUSED > +static int find_basename_matches(struct diff_options *options, > + int minimum_score, > + int num_src) > +{ > + int i; > + struct strintmap sources; > + struct strintmap dests; > + > + /* Create maps of basename -> fullname(s) for sources and dests */ > + strintmap_init_with_options(&sources, -1, NULL, 0); > + strintmap_init_with_options(&dests, -1, NULL, 0); Initially, I was wondering why we need the map for each side, but we will need to enforce uniqueness in each set, so OK. > + for (i = 0; i < num_src; ++i) { > + char *filename = rename_src[i].p->one->path; > + char *base; > + > + /* exact renames removed in remove_unneeded_paths_from_src() */ > + assert(!rename_src[i].p->one->rename_used); > + > + base = strrchr(filename, '/'); > + base = (base ? base+1 : filename); nit: "base + 1" Also, this is used here and below. Perhaps it's worth pulling out as a helper? I see similar code being duplicated in these existing spots: * diff-no-index.c:append_basename() * help.c:append_similar_ref() * packfile.c:pack_basename() * replace-object.c:register_replace_ref() * setup.c:read_gitfile_gently() * builtin/rebase.c:cmd_rebase() * builtin/stash.c:do_create_stash() * builtin/worktree.c:add_worktree() * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage() * contrib/credential/libsecret/git-credential-libsecret.c:usage() * trace2/tr2_dst.c:tr2_dst_try_auto_path() There are other places that use strchr(_, '/') but they seem to be related to peeling basenames off of paths and using the leading portion of the path. > + /* Record index within rename_src (i) if basename is unique */ > + if (strintmap_contains(&sources, base)) > + strintmap_set(&sources, base, -1); > + else > + strintmap_set(&sources, base, i); > + } > + for (i = 0; i < rename_dst_nr; ++i) { > + char *filename = rename_dst[i].p->two->path; > + char *base; > + > + if (rename_dst[i].is_rename) > + continue; /* involved in exact match already. */ > + > + base = strrchr(filename, '/'); > + base = (base ? base+1 : filename); > + > + /* Record index within rename_dst (i) if basename is unique */ > + if (strintmap_contains(&dests, base)) > + strintmap_set(&dests, base, -1); > + else > + strintmap_set(&dests, base, i); > + } > + > + /* TODO: Make use of basenames source and destination basenames */ > + > + strintmap_clear(&sources); > + strintmap_clear(&dests); > + > + return 0; > +} Thanks, -Stolee