Currently, we're encountering latency spikes in our container environment when a specific container with multiple Python-based tasks exits. These tasks may hold the zone->lock for an extended period, significantly impacting latency for other containers attempting to allocate memory. As a workaround, we've found that minimizing the pagelist size, such as setting it to 4 times the batch size, can help mitigate these spikes. However, managing vm.percpu_pagelist_high_fraction across a large fleet of servers poses challenges due to variations in CPU counts, NUMA nodes, and physical memory capacities. To enhance practicality, we propose allowing the setting of -1 for vm.percpu_pagelist_high_fraction to designate a minimum pagelist size. Furthermore, considering the challenges associated with utilizing vm.percpu_pagelist_high_fraction, it would be beneficial to introduce a more intuitive parameter, vm.percpu_pagelist_high_size, that would permit direct specification of the pagelist size as a multiple of the batch size. This methodology would mirror the functionality of vm.dirty_ratio and vm.dirty_bytes, providing users with greater flexibility and control. We have discussed the possibility of introducing multiple small zones to mitigate the contention on the zone->lock[0], but this approach is likely to require a longer-term implementation effort. Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@xxxxxxxxxxxxxxxxxxxx/ [0] Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> --- Documentation/admin-guide/sysctl/vm.rst | 4 ++++ mm/page_alloc.c | 8 ++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e86c968a7a0e..1f535d022cda 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the value of hot per-cpu page lists. A user can specify a number like 100 to allocate 1/100th of each zone between per-cpu lists. +The minimum number of pages that can be stored in per-CPU page lists is +four times the batch value. By writing '-1' to this sysctl, you can set +this minimum value. + The batch value of each per-cpu page list remains the same regardless of the value of the high fraction so allocation latencies are unaffected. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e22ce5675ca..e7313f9d704b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online, int nr_split_cpus; unsigned long total_pages; + /* Setting -1 to set the minimum pagelist size, four times the batch size */ + if (high_fraction == -1) + return batch << 2; + if (!high_fraction) { /* * By default, the high value of the pcp is based on the zone @@ -6192,7 +6196,8 @@ static int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table, /* Sanity checking to avoid pcp imbalance */ if (percpu_pagelist_high_fraction && - percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) { + percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION && + percpu_pagelist_high_fraction != -1) { percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction; ret = -EINVAL; goto out; @@ -6241,7 +6246,6 @@ static struct ctl_table page_alloc_sysctl_table[] = { .maxlen = sizeof(percpu_pagelist_high_fraction), .mode = 0644, .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, - .extra1 = SYSCTL_ZERO, }, { .procname = "lowmem_reserve_ratio", -- 2.43.5