Re: [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to

Elijah Newren <newren@xxxxxxxxx> · Thu, 15 Jul 2021 09:28:43 -0700

On Thu, Jul 15, 2021 at 7:55 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@xxxxxxxxx>
> ...
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
>
> Wow! This is quite the savings, and reinforces that when no renames
> exist we are likely to hit this optimization.

Just to clarify, "no-renames" was an unfortunate misnomer.  I should
have called it "few-renames".

But yeah, I was pretty amazed a year ago (when I initially implemented
this optimization) seeing the speedups on this testcase -- a testcase
that I hadn't even been paying much attention to beyond trying to
avoid regressing it.

> >     mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
>
> I'm surprised that this one works so well, too. Clearly, there must
> be a lot of directories that can be skipped despite many renames
> existing.

One small reason it works is that some of the commits don't touch any
paths involved in the big directory rename.  That means that for those
commits, all the renames under that big directory are irrelevant.
This fact makes the optimization work for 4 of the 35 patches.  (Of
course, I did pick that particular series of commits for rebasing
because it touched the directory being renamed a lot, including adding
files to it, which made it a better stress test case for my
optimizations.  But that does make it less likely for the
irrelevant-renames criteria to help us out on this particular
optimization.)

The bigger reason this optimization works so well on this testcase, is
the mere fact that this rebase involved a sequence of 35 commits.  The
optimization doesn't work for the first commit (as noted by the
just-one-mega testcase below).  However, once the first commit is
picked, renames on the upstream side that were relevant in the first
commit are cached, so the optimization does trigger for many of the
subsequent commits.  And whenever the optimization doesn't work on a
subsequent commit -- which happens because the new commit modifies
different files that make other renames become relevant -- then after
detecting the new renames those also get cached.  More cached renames
means a higher likelihood that we can skip recursing into directories
when rebasing subsequent commits.

> >     just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms
>
> And no overhead added to this case. Good.
>
> > +             /* Loop over the set of paths we need to know rename info for */
> > +             strset_for_each_entry(&renames->relevant_sources[side],
> > +                                   &iter, entry) {
> > +                     char *rename_target, *dir, *dir_marker;
> > +                     struct strmap_entry *e;
> > +
> > +                     /*
> > +                      * if we don't know delete/rename info for this path,
> > +                      * then we need to recurse into all trees to get all
> > +                      * adds to make sure we have it.
> > +                      */
>
> super-nit: s/if/If/

Will fix.

> Otherwise, I can't speak to the code being 100% correct because
> this area is so dense. I find the code to be well organized,
> which will help finding and fixing any potential bugs that might
> show up in strange corner cases later.
>
> Thanks,
> -Stolee