Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 15 Jul 2021 09:54:26 -0400

On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@xxxxxxxxx>

The first two patches were easy to follow. This one requires a more
careful read, so here is my attempt to double-check that I am
reading it right.

> In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> possible", 2021-03-11), we added an additional reason that rename
> detection could be skipped entirely -- namely, if no *relevant* sources
> were present.  Without completing collect_merge_info_callback(), we do
> not yet know if there are no relevant sources.  However, we do know that
> if the current directory on one side matches the merge base, then every
> source file within that directory will not be RELEVANT_CONTENT, and a
> few simple checks can often let us rule out RELEVANT_LOCATION as well.
> This suggests we can just defer recursing into such directories until
> the end of collect_merge_info.

We are not making early decisions, but collecting a list of directories
that cannot be relevant sources. These directories could still be
relevant locations (or content), but we don't want to pay the cost of
recursing right away.

> Since the deferred directories are known to not add any relevant sources
> due to the above properties, then if there are no relevant sources after
> we've traversed all paths other than the deferred ones, then we know
> there are not any relevant sources.  Under those conditions, rename
> detection is unnecessary, and that means we can resolve the deferred
> directories without recursing into them.

Only in the case where there are no relevant sources at the end of
scanning all directories (possibly non-recursively), then we can skip
recursing into these directories because finding relevant locations
is wasted effort now.

This means the performance benefit will be limited to cases where
rename detection isn't needed at all, but that is a frequent enough
case to merit the overhead of tracking this information differently.

> Note that the logic for skipping rename detection was also modified
> further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> renames when possible", 2021-01-30); in particular rename detection can
> be skipped if we already have cached renames for each relevant source.
> We can take advantage of this information as well with our deferral of
> recursing into directories where one side matches the merge base.
> 
> Add some data structures that we will use to do these deferrals, with
> some lengthy comments explaining their purpose.

And now that the idea is understood, we are just laying the groundwork
for the process by creating the tracking data structures, not actually
the algorithm to use them. Makes sense to split them up.

> +	/*
> +	 * possible_trivial_merges: directories we defer recursing into

Not only are we deferring the recursion, we might be able to avoid
it entirely. Perhaps "directories to be explored only when needed"
would emphasize this point differently?

> +	 *
> +	 * possible_trivial_merges is a map of directory names to
> +	 * dir_rename_mask.  When we detect that a directory is unchanged on
> +	 * one side, we can sometimes resolve the directory without recursing
> +	 * into it.  Renames are the only things that can prevent such an
> +	 * optimization.  However, for rename sources:
> +	 *   - If no parent directory needed directory rename detection, then
> +	 *     no path under such a directory can be a relevant_source.
> +	 * and for rename destinations:
> +	 *   - If no cached rename has a target path under the directory AND
> +	 *   - If there are no unpaired relevant_sources elsewhere in the
> +	 *     repository
> +	 * then we don't need any path under this directory for a rename
> +	 * destination.  The only way to know the last item above is to defer
> +	 * handling such directories until the end of collect_merge_info(),
> +	 * in handle_deferred_entries().
> +	 *
> +	 * For each we store dir_rename_mask, since that's the only bit of
> +	 * information we need, other than the path, to resume the recursive
> +	 * traversal.
> +	 */
> +	struct strintmap possible_trivial_merges[3];
> +
> +	/*
> +	 * trivial_merges_okay: if trivial directory merges are okay
> +	 *
> +	 * See possible_trivial_merges above.  The "no unpaired
> +	 * relevant_sources elsewhere in the repository" is a single boolean
> +	 * per merge side, which we store here.  Note that while 0 means no,
> +	 * 1 only means "maybe" rather than "yes"; we optimistically set it
> +	 * to 1 initially and only clear when we determine it is unsafe to
> +	 * do trivial directory merges.
> +	 */
> +	unsigned trivial_merges_okay[3];
> +
> +	/*
> +	 * target_dirs: ancestor directories of rename targets
> +	 *
> +	 * target_dirs contains all directory names that are an ancestor of
> +	 * any rename destination.
> +	 */
> +	struct strset target_dirs[3];

These three members seem tightly coupled. Would it be beneficial to
group the data into a 'struct trivial_merge_data' that could then
have an array of three here?

>  	/*
>  	 * dir_rename_mask:
>  	 *   0: optimization removing unmodified potential rename source okay
> @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
>  		strintmap_func(&renames->dirs_removed[i]);
>  		strmap_func(&renames->dir_renames[i], 0);
>  		strintmap_func(&renames->relevant_sources[i]);
> +		strintmap_func(&renames->possible_trivial_merges[i]);
> +		strset_func(&renames->target_dirs[i]);
> +		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */

This could be collected into an initializer method for the data
required by this optimization instead of three initializations
among many.

>  		if (!reinitialize)
>  			assert(renames->cached_pairs_valid_side == 0);
>  		if (i != renames->cached_pairs_valid_side) {
> @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
>  		strintmap_init_with_options(&renames->relevant_sources[i],
>  					    -1 /* explicitly invalid */,
>  					    NULL, 0);
> +		strintmap_init_with_options(&renames->possible_trivial_merges[i],
> +					    0, NULL, 0);
> +		strset_init_with_options(&renames->target_dirs[i],
> +					 NULL, 1);
>  		strmap_init_with_options(&renames->cached_pairs[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_irrelevant[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_target_names[i],
>  					 NULL, 0);
> +		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */

Here, there is even a separation between the grouped data. If you
don't like the idea of creating helper methods, then at least these
new lines could be grouped together. Bonus points for adding
whitespace between these lines and the others.

Thanks,
-Stolee