On Thu, 2022-03-03 at 14:27 +0100, Vlastimil Babka wrote: > On 2/8/22 11:07, Nicolas Saenz Julienne wrote: > > This series replaces mm/page_alloc's per-cpu page lists drain mechanism with > > one that allows accessing the lists remotely. Currently, only a local CPU is > > permitted to change its per-cpu lists, and it's expected to do so, on-demand, > > whenever a process demands it by means of queueing a drain task on the local > > CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't > > take any sort of interruption and to some lesser extent inconveniences idle and > > virtualised systems. > > > > The new algorithm will atomically switch the pointer to the per-cpu page lists > > and use RCU to make sure it's not being concurrently used before draining the > > lists. And its main benefit of is that it fixes the issue for good, avoiding > > the need for configuration based heuristics or having to modify applications > > (i.e. using the isolation prctrl being worked by Marcello Tosatti ATM). > > > > All this with minimal performance implications: a page allocation > > microbenchmark was run on multiple systems and architectures generally showing > > no performance differences, only the more extreme cases showed a 1-3% > > degradation. See data below. Needless to say that I'd appreciate if someone > > could validate my values independently. > > > > The approach has been stress-tested: I forced 100 drains/s while running > > mmtests' pft in a loop for a full day on multiple machines and archs (arm64, > > x86_64, ppc64le). > > > > Note that this is not the first attempt at fixing this per-cpu page lists: > > - The first attempt[1] tried to conditionally change the pagesets locking > > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the > > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue > > for NOHZ_FULL setups, which isn't ideal. > > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The > > performance degradation was too big. > > For completeness, what was the fate of the approach to have pcp->high = 0 > for NOHZ cpus? [1] It would be nice to have documented why it wasn't > feasible. Too much overhead for when these CPUs eventually do allocate, or > some other unforeseen issue? Thanks. Yes sorry, should've been more explicit on why I haven't gone that way yet. Some points: - As I mention above, not only CPU isolation users care for this. RT and HPC do too. This is my main motivation for focusing on this solution, or potentially Mel's. - Fully disabling pcplists on nohz_full CPUs is too drastic, as isolated CPUs might want to retain the performance edge while not running their sensitive workloads. (I remember Christoph Lamenter's commenting about this on the previous RFC). - So the idea would be to selectively disable pcplists upon entering in the really 'isolated' area. This could be achieved with Marcelo Tosatti's new WIP prctrl[1]. And if we decide the current solutions are unacceptable I'll have a go at it. Thanks! [1] https://lore.kernel.org/lkml/20220204173554.534186379@fedora.localdomain/T/ -- Nicolás Sáenz