Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves

Elijah Newren <newren@xxxxxxxxx> · Thu, 15 Jul 2021 08:54:36 -0700

On Thu, Jul 15, 2021 at 6:54 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@xxxxxxxxx>
>
> The first two patches were easy to follow. This one requires a more
> careful read, so here is my attempt to double-check that I am
> reading it right.

Yep, you seem to be understanding all of it just fine (does that mean
I did a good job explaining?), and made some good suggestions to
improve it further.

> > In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> > possible", 2021-03-11), we added an additional reason that rename
> > detection could be skipped entirely -- namely, if no *relevant* sources
> > were present.  Without completing collect_merge_info_callback(), we do
> > not yet know if there are no relevant sources.  However, we do know that
> > if the current directory on one side matches the merge base, then every
> > source file within that directory will not be RELEVANT_CONTENT, and a
> > few simple checks can often let us rule out RELEVANT_LOCATION as well.
> > This suggests we can just defer recursing into such directories until
> > the end of collect_merge_info.
>
> We are not making early decisions, but collecting a list of directories
> that cannot be relevant sources. These directories could still be
> relevant locations (or content), but we don't want to pay the cost of
> recursing right away.
>
> > Since the deferred directories are known to not add any relevant sources
> > due to the above properties, then if there are no relevant sources after
> > we've traversed all paths other than the deferred ones, then we know
> > there are not any relevant sources.  Under those conditions, rename
> > detection is unnecessary, and that means we can resolve the deferred
> > directories without recursing into them.
>
> Only in the case where there are no relevant sources at the end of
> scanning all directories (possibly non-recursively), then we can skip
> recursing into these directories because finding relevant locations
> is wasted effort now.
>
> This means the performance benefit will be limited to cases where
> rename detection isn't needed at all, but that is a frequent enough
> case to merit the overhead of tracking this information differently.
>
> > Note that the logic for skipping rename detection was also modified
> > further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> > renames when possible", 2021-01-30); in particular rename detection can
> > be skipped if we already have cached renames for each relevant source.
> > We can take advantage of this information as well with our deferral of
> > recursing into directories where one side matches the merge base.
> >
> > Add some data structures that we will use to do these deferrals, with
> > some lengthy comments explaining their purpose.
>
> And now that the idea is understood, we are just laying the groundwork
> for the process by creating the tracking data structures, not actually
> the algorithm to use them. Makes sense to split them up.
>
> > +     /*
> > +      * possible_trivial_merges: directories we defer recursing into
>
> Not only are we deferring the recursion, we might be able to avoid
> it entirely. Perhaps "directories to be explored only when needed"
> would emphasize this point differently?

I like the sound of that; seems clearer.  Will update.

> > +      *
> > +      * possible_trivial_merges is a map of directory names to
> > +      * dir_rename_mask.  When we detect that a directory is unchanged on
> > +      * one side, we can sometimes resolve the directory without recursing
> > +      * into it.  Renames are the only things that can prevent such an
> > +      * optimization.  However, for rename sources:
> > +      *   - If no parent directory needed directory rename detection, then
> > +      *     no path under such a directory can be a relevant_source.
> > +      * and for rename destinations:
> > +      *   - If no cached rename has a target path under the directory AND
> > +      *   - If there are no unpaired relevant_sources elsewhere in the
> > +      *     repository
> > +      * then we don't need any path under this directory for a rename
> > +      * destination.  The only way to know the last item above is to defer
> > +      * handling such directories until the end of collect_merge_info(),
> > +      * in handle_deferred_entries().
> > +      *
> > +      * For each we store dir_rename_mask, since that's the only bit of
> > +      * information we need, other than the path, to resume the recursive
> > +      * traversal.
> > +      */
> > +     struct strintmap possible_trivial_merges[3];
> > +
> > +     /*
> > +      * trivial_merges_okay: if trivial directory merges are okay
> > +      *
> > +      * See possible_trivial_merges above.  The "no unpaired
> > +      * relevant_sources elsewhere in the repository" is a single boolean
> > +      * per merge side, which we store here.  Note that while 0 means no,
> > +      * 1 only means "maybe" rather than "yes"; we optimistically set it
> > +      * to 1 initially and only clear when we determine it is unsafe to
> > +      * do trivial directory merges.
> > +      */
> > +     unsigned trivial_merges_okay[3];
> > +
> > +     /*
> > +      * target_dirs: ancestor directories of rename targets
> > +      *
> > +      * target_dirs contains all directory names that are an ancestor of
> > +      * any rename destination.
> > +      */
> > +     struct strset target_dirs[3];
>
> These three members seem tightly coupled. Would it be beneficial to
> group the data into a 'struct trivial_merge_data' that could then
> have an array of three here?

Ooh, I like that idea; makes sense.  That'll naturally flow with your
last two comments too.

> >       /*
> >        * dir_rename_mask:
> >        *   0: optimization removing unmodified potential rename source okay
> > @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
> >               strintmap_func(&renames->dirs_removed[i]);
> >               strmap_func(&renames->dir_renames[i], 0);
> >               strintmap_func(&renames->relevant_sources[i]);
> > +             strintmap_func(&renames->possible_trivial_merges[i]);
> > +             strset_func(&renames->target_dirs[i]);
> > +             renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> This could be collected into an initializer method for the data
> required by this optimization instead of three initializations
> among many.
>
> >               if (!reinitialize)
> >                       assert(renames->cached_pairs_valid_side == 0);
> >               if (i != renames->cached_pairs_valid_side) {
> > @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
> >               strintmap_init_with_options(&renames->relevant_sources[i],
> >                                           -1 /* explicitly invalid */,
> >                                           NULL, 0);
> > +             strintmap_init_with_options(&renames->possible_trivial_merges[i],
> > +                                         0, NULL, 0);
> > +             strset_init_with_options(&renames->target_dirs[i],
> > +                                      NULL, 1);
> >               strmap_init_with_options(&renames->cached_pairs[i],
> >                                        NULL, 1);
> >               strset_init_with_options(&renames->cached_irrelevant[i],
> >                                        NULL, 1);
> >               strset_init_with_options(&renames->cached_target_names[i],
> >                                        NULL, 0);
> > +             renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> Here, there is even a separation between the grouped data. If you
> don't like the idea of creating helper methods, then at least these
> new lines could be grouped together. Bonus points for adding
> whitespace between these lines and the others.