On Thu, Dec 09, 2021 at 02:45:35PM -0300, Marcelo Tosatti wrote: > On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote: > > On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote: > > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu > > > drain work queued by __drain_all_pages(). So introduce new a mechanism > > > to remotely drain the per-cpu lists. It is made possible by remotely > > > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this > > > new scheme is that drain operations are now migration safe. > > > > > > There was no observed performance degradation vs. the previous scheme. > > > Both netperf and hackbench were run in parallel to triggering the > > > __drain_all_pages(NULL, true) code path around ~100 times per second. > > > The new scheme performs a bit better (~5%), although the important point > > > here is there are no performance regressions vs. the previous mechanism. > > > Per-cpu lists draining happens only in slow paths. > > > > > > > netperf and hackbench are not great indicators of page allocator > > performance as IIRC they are more slab-intensive than page allocator > > intensive. I ran the series through a few benchmarks and can confirm > > that there was negligible difference to netperf and hackbench. > > > > However, on Page Fault Test (pft in mmtests), it is noticable. On a > > 2-socket cascadelake machine I get > > > > pft timings > > 5.16.0-rc1 5.16.0-rc1 > > vanilla mm-remotedrain-v2r1 > > Amean system-1 27.48 ( 0.00%) 27.85 * -1.35%* > > Amean system-4 28.65 ( 0.00%) 30.84 * -7.65%* > > Amean system-7 28.70 ( 0.00%) 32.43 * -13.00%* > > Amean system-12 30.33 ( 0.00%) 34.21 * -12.80%* > > Amean system-21 37.14 ( 0.00%) 41.51 * -11.76%* > > Amean system-30 36.79 ( 0.00%) 46.15 * -25.43%* > > Amean system-48 58.95 ( 0.00%) 65.28 * -10.73%* > > Amean system-79 111.61 ( 0.00%) 114.78 * -2.84%* > > Amean system-80 113.59 ( 0.00%) 116.73 * -2.77%* > > Amean elapsed-1 32.83 ( 0.00%) 33.12 * -0.88%* > > Amean elapsed-4 8.60 ( 0.00%) 9.17 * -6.66%* > > Amean elapsed-7 4.97 ( 0.00%) 5.53 * -11.30%* > > Amean elapsed-12 3.08 ( 0.00%) 3.43 * -11.41%* > > Amean elapsed-21 2.19 ( 0.00%) 2.41 * -10.06%* > > Amean elapsed-30 1.73 ( 0.00%) 2.04 * -17.87%* > > Amean elapsed-48 1.73 ( 0.00%) 2.03 * -17.77%* > > Amean elapsed-79 1.61 ( 0.00%) 1.64 * -1.90%* > > Amean elapsed-80 1.60 ( 0.00%) 1.64 * -2.50%* > > > > It's not specific to cascade lake, I see varying size regressions on > > different Intel and AMD chips, some better and worse than this result. > > The smallest regression was on a single CPU skylake machine with a 2-6% > > hit. Worst was Zen1 with a 3-107% hit. > > > > I didn't profile it to establish why but in all cases the system CPU > > usage was much higher. It *might* be because the spinlock in > > per_cpu_pages crosses a new cache line and it might be cold although the > > penalty seems a bit high for that to be the only factor. > > > > Code-wise, the patches look fine but the apparent penalty for PFT is > > too severe. > > Mel, > > Have you read Nicolas RCU patches? > I agree with Vlastimil's review on overhead. I think it would be more straight-forward to disable the pcp allocator for NOHZ_FULL CPUs like what zone_pcp_disable except for individual CPUs with care taken to not accidentally re-enable nohz CPus in zone_pcp_enable. The downside is that there will be a performance penalty if an application running on a NOHZ_FULL CPU is page allocator intensive for whatever reason. However, I guess this is unlikely because if there was a lot of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also cause interference. -- Mel Gorman SUSE Labs