Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Tue, 14 Dec 2021 07:58:23 -0300

On Fri, Dec 10, 2021 at 10:55:49AM +0000, Mel Gorman wrote:
> On Thu, Dec 09, 2021 at 02:45:35PM -0300, Marcelo Tosatti wrote:
> > On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> > > On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > > > to remotely drain the per-cpu lists. It is made possible by remotely
> > > > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > > > new scheme is that drain operations are now migration safe.
> > > > 
> > > > There was no observed performance degradation vs. the previous scheme.
> > > > Both netperf and hackbench were run in parallel to triggering the
> > > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > > The new scheme performs a bit better (~5%), although the important point
> > > > here is there are no performance regressions vs. the previous mechanism.
> > > > Per-cpu lists draining happens only in slow paths.
> > > > 
> > > 
> > > netperf and hackbench are not great indicators of page allocator
> > > performance as IIRC they are more slab-intensive than page allocator
> > > intensive. I ran the series through a few benchmarks and can confirm
> > > that there was negligible difference to netperf and hackbench.
> > > 
> > > However, on Page Fault Test (pft in mmtests), it is noticable. On a
> > > 2-socket cascadelake machine I get
> > > 
> > > pft timings
> > >                                  5.16.0-rc1             5.16.0-rc1
> > >                                     vanilla    mm-remotedrain-v2r1
> > > Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> > > Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> > > Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> > > Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> > > Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> > > Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> > > Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> > > Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> > > Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> > > Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> > > Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> > > Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> > > Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> > > Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> > > Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> > > Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> > > Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> > > Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> > > 
> > > It's not specific to cascade lake, I see varying size regressions on
> > > different Intel and AMD chips, some better and worse than this result.
> > > The smallest regression was on a single CPU skylake machine with a 2-6%
> > > hit. Worst was Zen1 with a 3-107% hit.
> > > 
> > > I didn't profile it to establish why but in all cases the system CPU
> > > usage was much higher. It *might* be because the spinlock in
> > > per_cpu_pages crosses a new cache line and it might be cold although the
> > > penalty seems a bit high for that to be the only factor.
> > > 
> > > Code-wise, the patches look fine but the apparent penalty for PFT is
> > > too severe.
> > 
> > Mel,
> > 
> > Have you read Nicolas RCU patches?
> > 
> 
> I agree with Vlastimil's review on overhead.

Not sure those points are any fundamental performance problem with RCU: 
https://paulmck.livejournal.com/31058.html

> I think it would be more straight-forward to disable the pcp allocator for
> NOHZ_FULL CPUs like what zone_pcp_disable except for individual CPUs with
> care taken to not accidentally re-enable nohz CPus in zone_pcp_enable. The
> downside is that there will be a performance penalty if an application
> running on a NOHZ_FULL CPU is page allocator intensive for whatever
> reason.  However, I guess this is unlikely because if there was a lot
> of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> cause interference.

Yes, it does, and its being fixed:

https://lkml.org/lkml/2021/12/8/663

Honestly i am not sure whether the association between a nohz_full CPU
and "should be mostly in userspace" is desired. The RCU solution
would be more generic. As Nicolas mentioned, for the usecases in
questions, either solution is OK.

Thomas, Frederic, Christoph, do you have any opinion on this ?