On Wed, Sep 20, 2023 at 09:41:18AM -0700, Andrew Morton wrote: > On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@xxxxxxxxx> wrote: > > > The page allocation performance requirements of different workloads > > are often different. So, we need to tune the PCP (Per-CPU Pageset) > > high on each CPU automatically to optimize the page allocation > > performance. > > Some of the performance changes here are downright scary. > > I've never been very sure that percpu pages was very beneficial (and > hey, I invented the thing back in the Mesozoic era). But these numbers > make me think it's very important and we should have been paying more > attention. > FWIW, it is because not only does it avoid lock contention issues, it avoids excessive splitting/merging of buddies as well as the slower paths of the allocator. It is not very satisfactory and frankly, the whole page allocator needs a revisit to account for very large zones but it is far from a trivial project. PCP just masks the worst of the issues and replacing it is far harder than tweaking it. > > The list of patches in series is as follows, > > > > 1 mm, pcp: avoid to drain PCP when process exit > > 2 cacheinfo: calculate per-CPU data cache size > > 3 mm, pcp: reduce lock contention for draining high-order pages > > 4 mm: restrict the pcp batch scale factor to avoid too long latency > > 5 mm, page_alloc: scale the number of pages that are batch allocated > > 6 mm: add framework for PCP high auto-tuning > > 7 mm: tune PCP high automatically > > 8 mm, pcp: decrease PCP high if free pages < high watermark > > 9 mm, pcp: avoid to reduce PCP high unnecessarily > > 10 mm, pcp: reduce detecting time of consecutive high order page freeing > > > > Patch 1/2/3 optimize the PCP draining for consecutive high-order pages > > freeing. > > > > Patch 4/5 optimize batch freeing and allocating. > > > > Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. > > > > Patch 10 optimize the PCP draining for consecutive high order page > > freeing based on PCP high auto-tuning. > > > > The test results for patches with performance impact are as follows, > > > > kbuild > > ====== > > > > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on > > one socket with `make -j 112`. > > > > build time zone lock% free_high alloc_zone > > ---------- ---------- --------- ---------- > > base 100.0 43.6 100.0 100.0 > > patch1 96.6 40.3 49.2 95.2 > > patch3 96.4 40.5 11.3 95.1 > > patch5 96.1 37.9 13.3 96.8 > > patch7 86.4 9.8 6.2 22.0 > > patch9 85.9 9.4 4.8 16.3 > > patch10 87.7 12.6 29.0 32.3 > > You're seriously saying that kbuild got 12% faster? > > I see that [07/10] (autotuning) alone sped up kbuild by 10%? > > Other thoughts: > > - What if any facilities are provided to permit users/developers to > monitor the operation of the autotuning algorithm? > Not that I've seen yet but I'm still in part of the series. It could be monitored with tracepoints but it can also be inferred from lock contention issue. I think it would only be meaningful to developers to monitor this closely, at least that's what I think now. Honestly, I'm more worried about potential changes in behaviour depending on the exact CPU and cache implementation than I am about being able to actively monitor it. > - I'm not seeing any Documentation/ updates. Surely there are things > we can tell users? > > - This: > > : It's possible that PCP high auto-tuning doesn't work well for some > : workloads. So, when PCP high is tuned by hand via the sysctl knob, > : the auto-tuning will be disabled. The PCP high set by hand will be > : used instead. > > Is it a bit hacky to disable autotuning when the user alters > pcp-high? Would it be cleaner to have a separate on/off knob for > autotuning? > It might be but tuning the allocator is very specific and once we introduce that tunable, we're probably stuck with it. I would prefer to see it introduced if and only if we have to. > And how is the user to determine that "PCP high auto-tuning doesn't work > well" for their workload? Not easily. It may manifest as variable lock contention issues when the workload is at a steady state but that would increase the pressure to split the allocator away from being zone-based entirely instead of tweaking PCP further. -- Mel Gorman SUSE Labs