On Mon, May 17, 2021 at 6:41 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > On 5/3/21 10:12 PM, Elijah Newren via GitGitGadget wrote: > > From: Elijah Newren <newren@xxxxxxxxx> > > > > When there are many renames between the old base of a series of commits > > and the new base for a series of commits, the sequence of merges > > employed to transplant those commits (from a cherry-pick or rebase > > operation) will repeatedly detect the exact same renames. This is > > wasted effort. > > > > Add data structures which will be used to cache rename detection > > results, along with the initialization and deallocation of these data > > structures. Future commits will populate these caches, detect the > > appropriate circumstances when they can be used, and employ them to > > avoid re-detecting the same renames repeatedly. > > I appreciate the definitions and boilerplate for these data > structures being isolated to their own patch. > > > @@ -140,6 +140,37 @@ struct rename_info { > > int callback_data_nr, callback_data_alloc; > > char *callback_data_traverse_path; > > > > + /* > > + * cached_pairs: Caching of renames and deletions. > > + * > > + * These are mappings recording renames and deletions of individual > > + * files (not directories). They are thus a map from an old > > + * filename to either NULL (for deletions) or a new filename (for > > + * renames). > > + */ > > + struct strmap cached_pairs[3]; > > + > > + /* > > + * cached_target_names: just the destinations from cached_pairs > > + * > > + * We sometimes want a fast lookup to determine if a given filename > > + * is one of the destinations in cached_pairs. cached_target_names > > + * is thus duplicative information, but it provides a fast lookup. > > + */ > > + struct strset cached_target_names[3]; > > These two work well together. Very clear. > > > + /* > > + * cached_irrelevant: Caching of rename_sources that aren't relevant. > > + * > > + * cached_pairs records both renames and deletes. Sometimes we > > + * do not know if a path is a rename or a delete because we pass > > + * RELEVANT_LOCATION to diffcore_rename_extended() and based on > > + * various optimizations it returns without detecting whether that > > + * path is actually a rename or a delete. We need to cache such > > + * paths too, but separately from cached_pairs. > > + */ > > + struct strset cached_irrelevant[3]; > > I'm having a hard time parsing what these "irrelevant" paths will be. > It seems like diffcore_rename_extended() will report something other > than "rename" or "delete" for some paths. Could we explicitly mark > that state as "irrelevant"? The state is better known as RELEVANT_NO_MORE, yes. > /* > * cached_irrelevant: Caching of rename_sources that aren't relevant. > * > * cached_pairs records both renames and deletes. Sometimes we > * do not know if a path is a rename or a delete because we pass > * RELEVANT_LOCATION to diffcore_rename_extended() which might > * describe a path as "irrelevant" instead of as a "rename" or "delete". > * We need to cache such paths too, but separately from cached_pairs. > */ > > Does this make sense? diffcore_rename_extended() might need an update > to match this extra, explicit state. Hmm, let's flesh out the description a bit and try to be more explicit. How about: /* * cached_irrelevant: Caching of rename_sources that aren't relevant. * * If we try to detect a rename for a source path and succeed, it's * part of a rename. If we try to detect a rename for a source path * and fail, then it's a delete. If we do not try to detect a rename * for a path, then we don't know if it's a rename or a delete. If * merge-ort doesn't think the path is relevant, then we just won't * cache anything for that path. But there's a slight problem in * that merge-ort can think a path is RELEVANT_LOCATION, but due to * commit 9bd342137e ("diffcore-rename: determine which * relevant_sources are no longer relevant", 2021-03-13), * diffcore-rename can downgrade the path to RELEVANT_NO_MORE. To * avoid excessive calls to diffcore_rename_extended() we still need * to cache such paths, though we cannot record them as either * renames or deletes. So we cache them here as a "turned out to be * irrelevant *for this commit*" as they are often also irrelevant * for subsequent commits, though we will have to do some extra * checking to see whether such paths become relevant for rename * detection when cherry-picking/rebasing subsequent commits. */