On Thu, Jul 15, 2021 at 6:54 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote: > > From: Elijah Newren <newren@xxxxxxxxx> > > The first two patches were easy to follow. This one requires a more > careful read, so here is my attempt to double-check that I am > reading it right. Yep, you seem to be understanding all of it just fine (does that mean I did a good job explaining?), and made some good suggestions to improve it further. > > In commit f89b4f2bee ("merge-ort: skip rename detection entirely if > > possible", 2021-03-11), we added an additional reason that rename > > detection could be skipped entirely -- namely, if no *relevant* sources > > were present. Without completing collect_merge_info_callback(), we do > > not yet know if there are no relevant sources. However, we do know that > > if the current directory on one side matches the merge base, then every > > source file within that directory will not be RELEVANT_CONTENT, and a > > few simple checks can often let us rule out RELEVANT_LOCATION as well. > > This suggests we can just defer recursing into such directories until > > the end of collect_merge_info. > > We are not making early decisions, but collecting a list of directories > that cannot be relevant sources. These directories could still be > relevant locations (or content), but we don't want to pay the cost of > recursing right away. > > > Since the deferred directories are known to not add any relevant sources > > due to the above properties, then if there are no relevant sources after > > we've traversed all paths other than the deferred ones, then we know > > there are not any relevant sources. Under those conditions, rename > > detection is unnecessary, and that means we can resolve the deferred > > directories without recursing into them. > > Only in the case where there are no relevant sources at the end of > scanning all directories (possibly non-recursively), then we can skip > recursing into these directories because finding relevant locations > is wasted effort now. > > This means the performance benefit will be limited to cases where > rename detection isn't needed at all, but that is a frequent enough > case to merit the overhead of tracking this information differently. > > > Note that the logic for skipping rename detection was also modified > > further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached > > renames when possible", 2021-01-30); in particular rename detection can > > be skipped if we already have cached renames for each relevant source. > > We can take advantage of this information as well with our deferral of > > recursing into directories where one side matches the merge base. > > > > Add some data structures that we will use to do these deferrals, with > > some lengthy comments explaining their purpose. > > And now that the idea is understood, we are just laying the groundwork > for the process by creating the tracking data structures, not actually > the algorithm to use them. Makes sense to split them up. > > > + /* > > + * possible_trivial_merges: directories we defer recursing into > > Not only are we deferring the recursion, we might be able to avoid > it entirely. Perhaps "directories to be explored only when needed" > would emphasize this point differently? I like the sound of that; seems clearer. Will update. > > + * > > + * possible_trivial_merges is a map of directory names to > > + * dir_rename_mask. When we detect that a directory is unchanged on > > + * one side, we can sometimes resolve the directory without recursing > > + * into it. Renames are the only things that can prevent such an > > + * optimization. However, for rename sources: > > + * - If no parent directory needed directory rename detection, then > > + * no path under such a directory can be a relevant_source. > > + * and for rename destinations: > > + * - If no cached rename has a target path under the directory AND > > + * - If there are no unpaired relevant_sources elsewhere in the > > + * repository > > + * then we don't need any path under this directory for a rename > > + * destination. The only way to know the last item above is to defer > > + * handling such directories until the end of collect_merge_info(), > > + * in handle_deferred_entries(). > > + * > > + * For each we store dir_rename_mask, since that's the only bit of > > + * information we need, other than the path, to resume the recursive > > + * traversal. > > + */ > > + struct strintmap possible_trivial_merges[3]; > > + > > + /* > > + * trivial_merges_okay: if trivial directory merges are okay > > + * > > + * See possible_trivial_merges above. The "no unpaired > > + * relevant_sources elsewhere in the repository" is a single boolean > > + * per merge side, which we store here. Note that while 0 means no, > > + * 1 only means "maybe" rather than "yes"; we optimistically set it > > + * to 1 initially and only clear when we determine it is unsafe to > > + * do trivial directory merges. > > + */ > > + unsigned trivial_merges_okay[3]; > > + > > + /* > > + * target_dirs: ancestor directories of rename targets > > + * > > + * target_dirs contains all directory names that are an ancestor of > > + * any rename destination. > > + */ > > + struct strset target_dirs[3]; > > These three members seem tightly coupled. Would it be beneficial to > group the data into a 'struct trivial_merge_data' that could then > have an array of three here? Ooh, I like that idea; makes sense. That'll naturally flow with your last two comments too. > > /* > > * dir_rename_mask: > > * 0: optimization removing unmodified potential rename source okay > > @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti, > > strintmap_func(&renames->dirs_removed[i]); > > strmap_func(&renames->dir_renames[i], 0); > > strintmap_func(&renames->relevant_sources[i]); > > + strintmap_func(&renames->possible_trivial_merges[i]); > > + strset_func(&renames->target_dirs[i]); > > + renames->trivial_merges_okay[i] = 1; /* 1 == maybe */ > > This could be collected into an initializer method for the data > required by this optimization instead of three initializations > among many. > > > if (!reinitialize) > > assert(renames->cached_pairs_valid_side == 0); > > if (i != renames->cached_pairs_valid_side) { > > @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result) > > strintmap_init_with_options(&renames->relevant_sources[i], > > -1 /* explicitly invalid */, > > NULL, 0); > > + strintmap_init_with_options(&renames->possible_trivial_merges[i], > > + 0, NULL, 0); > > + strset_init_with_options(&renames->target_dirs[i], > > + NULL, 1); > > strmap_init_with_options(&renames->cached_pairs[i], > > NULL, 1); > > strset_init_with_options(&renames->cached_irrelevant[i], > > NULL, 1); > > strset_init_with_options(&renames->cached_target_names[i], > > NULL, 0); > > + renames->trivial_merges_okay[i] = 1; /* 1 == maybe */ > > Here, there is even a separation between the grouped data. If you > don't like the idea of creating helper methods, then at least these > new lines could be grouped together. Bonus points for adding > whitespace between these lines and the others.