Re: [PATCH v2 0/7] Final optimization batch (#15): use memory pools

Elijah Newren <newren@xxxxxxxxx> · Thu, 29 Jul 2021 14:46:03 -0600

On Thu, Jul 29, 2021 at 10:20 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Thu, Jul 29, 2021 at 03:58:34AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > This series is more about strmaps & memory pools than merge logic. CC'ing
> > Peff since he reviewed the strmap work[1], and that work included a number
> > of decisions that specifically had this series in mind.
>
> I haven't been following the other optimization threads very closely,
> but I'll try to give my general impressions.
>
> > === Basic Optimization idea ===
> >
> > In this series, I make use of memory pools to get faster allocations and
> > deallocations for many data structures that tend to all be deallocated at
> > the same time anyway.
> >
> > === Results ===
> >
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28), the
> > changes in just this series improves the performance as follows:
> >
> >                      Before Series           After Series
> > no-renames:      204.2  ms ±  3.0  ms    198.3 ms ±  2.9 ms
> > mega-renames:      1.076 s ±  0.015 s    661.8 ms ±  5.9 ms
> > just-one-mega:   364.1  ms ±  7.0  ms    264.6 ms ±  2.5 ms
>
> Pretty good results for the mega-renames case. I do wonder how much this
> matters in practice. That case is intentionally stressing the system,
> though I guess it's not too far-fetched (it's mostly a big directory
> rename). However, just "git checkout" across the rename already takes
> more than a second. So on the one hand, 400ms isn't nothing. On the
> other, I doubt anybody is likely to notice in the grand scheme of
> things.

The mega-renames case was spurred by looking at a repository at
$DAYJOB and trying to generate a similar testcase on a well-known open
source repository with approximately the same number of files.  The
linux kernel was handy for that.  Technically, to make it match the
right number of renames, I would have needed to rename more toplevel
directories, but it was close enough and I liked it being a simple
testcase.  So, to me, the testcase is more unusual in the number of
patches being rebased rather than in the number of renames, but I
chose a long sequence of patches (with lots of different types of
changes) because that served as a good correctness case and even ended
up being good for spurring searches for tweaks to existing
optimizations.  And I added the just-one-mega to see how a simple
cherry-pick would work with the large number of renames.  It too has a
good relative speedup.

You make fair points about the absolute timings for rebase, but the
"grand scheme of things" involves usecases outside of traditional
rebases.  Some of those involve far more merges, making the relative
timings more important.  They also involve much less overhead -- not
only do they get to ignore the "git checkout" present in most rebases,
but they also get to exclude the index or worktree updating that are
present above.  Without that extra overhead, the relative improvement
from this patch is even greater. One particular example usecase that
is coming soon is the introduction of `git log --remerge-diff`; it
makes the timing of individual merges critical since it does so many
of them.  And it becomes even more important for users who change the
default from --diff-merges=off to --diff-merges=remerge-diff, because
then any `git log -p` will potentially remerge hundreds or thousands
of merge commits.  (I've got lots of users who have -p imply
--remerge-diff since last November.)

> And we're paying a non-trivial cost in code complexity to do it (though
> I do think you've done an admirable job of making that cost as low as
> possible). Dropping the USE_MEMORY_POOL flag and just always using a
> pool would make a lot of that complexity go away. I understand how it
> makes leak hunting harder, but I think simpler code would probably be
> worth the tradeoff (and in a sense, there _aren't_ leaks in an
> always-pool world; we're holding on to all of the memory through the
> whole operation).

Yeah, I had to keep the USE_MEMORY_POOL flag as I was rebasing my
sequence of optimization series to make sure I kept the intermediate
steps clean while upstreaming all the work.  I ended up just leaving
(a simplified form of) it in when I was all done, but it has probably
already served its purpose.  I'd be fine with dropping it.

> I assume your tests are just done using the regular glibc allocator. I

Yes.

> also wondered how plugging in a better allocator might fare. Here are
> timings I did of your mega-renames case with three binaries: one built
> with USE_MEMORY_POOL set to 0, one with it set to 1, and one with it set
> to 0 but adding "-ltcmalloc" to EXTLIBS via config.mak.
>
>   $ hyperfine \
>       -p 'git checkout hwmon-updates &&
>           git reset --hard fd8bdb23b91876ac1e624337bb88dc1dcc21d67e &&
>           git checkout 5.4-renames^0' \
>       -L version nopool,pool,tcmalloc \
>       './test-tool.{version} fast-rebase --onto HEAD base hwmon-updates'

Ooh, I didn't know about -L.

>   Benchmark #1: ./test-tool.nopool fast-rebase --onto HEAD base hwmon-updates
>     Time (mean ± σ):     921.1 ms ± 146.0 ms    [User: 843.0 ms, System: 77.5 ms]
>     Range (min … max):   660.9 ms … 1112.2 ms    10 runs
>
>   Benchmark #2: ./test-tool.pool fast-rebase --onto HEAD base hwmon-updates
>     Time (mean ± σ):     635.4 ms ± 125.5 ms    [User: 563.7 ms, System: 71.3 ms]
>     Range (min … max):   496.8 ms … 856.7 ms    10 runs
>
>   Benchmark #3: ./test-tool.tcmalloc fast-rebase --onto HEAD base hwmon-updates
>     Time (mean ± σ):     727.3 ms ± 139.9 ms    [User: 654.1 ms, System: 72.9 ms]
>     Range (min … max):   476.3 ms … 900.5 ms    10 runs

That's some _really_ wide variance on your runs, making me wonder if
you are running other things on your (I presume) laptop that are
potentially muddying the numbers.  Would the tcmalloc case actually
have the fastest run in general, or was it just lucky to hit a "quiet"
moment on the laptop?

Or perhaps my pre-warming script helps reduce variance more than I thought...

>   Summary
>     './test-tool.pool fast-rebase --onto HEAD base hwmon-updates' ran
>       1.14 ± 0.32 times faster than './test-tool.tcmalloc fast-rebase --onto HEAD base hwmon-updates'
>       1.45 ± 0.37 times faster than './test-tool.nopool fast-rebase --onto HEAD base hwmon-updates'
>
> The pool allocator does come out ahead when comparing means, but the
> improvement is within the noise (and the fastest run was actually with
> tcmalloc).
>
> I was also curious about peak heap usage. According to massif, the pool
> version peaks at ~800k extra (out of 82MB), which is negligible. Plus it
> has fewer overall allocations, so it seems to actually save 4-5MB in
> malloc overhead (though I would imagine that varies between allocators;
> I'm just going from massif numbers here).

I did similar testing a year ago, before I even looked at memory
pools.  I was surprised by how big a speedup I saw, and considered
asking on the list if we could push to use a different allocator by
default.  Ultimately, I figured that probably wouldn't fly and
distributors might override our choices anyway.  It was at that point
that I decided to start tweaking mem-pool.[ch] (which ended up getting
merged at edab8a8d07 ("Merge branch 'en/mem-pool'", 2020-08-27)), and
then integrating that into strmap/strset/strintmap -- all in an effort
to guarantee that we realized the speedups that I knew were possible
due to my testing with the special allocators.

> So...I dunno. It's hard to assess the real-world impact of the speedup,
> compared to the complexity cost. Ultimately, this is changing code that
> you wrote and will probably maintain. So I'd leave the final decision
> for that tradeoff to you. I'm just injecting some thoughts and numbers. :)

It's nice to see others duplicate the results, and I appreciate the
sanity check on overall usefulness.  If it were just optimizing
rebase, I probably could have quit long ago when I had 15s rebases of
35 patches.  It was part stubbornness on my part wondering why it
couldn't be sub-second, and part knowing of other usecases that I
wanted to make attractive to others.  I want --remerge-diff to be
useful *and* practical.  I want rebasing/cherry-picking un-checked-out
branches to be useful and practical.  And I'd like to entice server
operators (GitHub, GitLab, etc.) to use the same merge machinery used
by end users so that reported merge-ability matches what end users see
when they try it themselves.

I think you make a good suggestion to drop the USE_MEMORY_POOL switch.
I think I'll do it as an additional patch at the end of the series,
just so it's easy for me to restore if by change it's ever needed.