Background ========== In our containerized environment, we have a specific type of container that runs 18 processes, each consuming approximately 6GB of RSS. These processes are organized as separate processes rather than threads due to the Python Global Interpreter Lock (GIL) being a bottleneck in a multi-threaded setup. Upon the exit of these containers, other containers hosted on the same machine experience significant latency spikes. Investigation ============= Duration my investigation on this issue, I found the latency spikes were caused by the zone->lock contention. That can be illustrated as follows, CPU A (Freer) CPU B (Allocator) lock zone->lock free pages lock zone->lock unlock zone->lock alloc pages unlock zone->lock If the Freer holds the zone->lock for an extended period, the Allocator has to wait and thus latency spikes occures. I also wrote a python script to reproduce it on my test servers. See the dedails in patch #3. It is worth to note that the reproducer is based on the upstream kernel. Experimenting ============= As the more pages to be freed in one batch, the long the duration will be. So my attempt involves reducing the batch size. After I restrict the batch to the smallest size, there is no complains on the latency spikes any more. However, duration my experiment, I found that the CONFIG_PCP_BATCH_SCALE_MAX is hard to use in practice. So I try to improve it in this series. The Proposal ============ This series encompasses two minor refinements to the PCP high watermark auto-tuning mechanism, along with the introduction of a new sysctl knob that serves as a more practical alternative to the previous configuration method. Future work =========== To ultimately mitigate the zone->lock contention issue, several suggestions have been proposed. One approach involves dividing large zones into multi smaller zones, as suggested by Matthew[0], while another entails splitting the zone->lock using a mechanism similar to memory arenas and shifting away from relying solely on zone_id to identify the range of free lists a particular page belongs to, as suggested by Mel[1]. However, implementing these solutions is likely to necessitate a more extended development effort. Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@xxxxxxxxxxxxxxxxxxxx/ [0] Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@xxxxxxxxxxxxxxxxxxx/ [1] Changes: - v2->v3: - commit log refinement - rebase it on mm-everything - v1-> v2: https://lwn.net/Articles/983837/ Commit log refinement - v1: mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max https://lwn.net/Articles/981069/ - mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@xxxxxxxxx/ Yafang Shao (3): mm/page_alloc: A minor fix to the calculation of pcp->free_count mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++ mm/Kconfig | 11 ------- mm/page_alloc.c | 40 ++++++++++++++++++------- 3 files changed, 47 insertions(+), 21 deletions(-) -- 2.34.1