[PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max

Yafang Shao <laoar.shao@xxxxxxxxx> · Sun, 4 Aug 2024 16:01:04 +0800

Background
==========

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Investigation
=============

Duration my investigation on this issue, I found the latency spikes were
caused by the zone->lock contention. That can be illustrated as follows,

   CPU A (Freer)                 CPU B (Allocator)
  lock zone->lock
  free pages                      lock zone->lock
  unlock zone->lock               
                                  alloc pages
                                  unlock zone->lock

If the Freer holds the zone->lock for an extended period, the Allocator
has to wait and thus latency spikes occures.

I also wrote a python script to reproduce it on my test servers. See the
dedails in patch #3. It is worth to note that the reproducer is based on
the upstream kernel.

Experimenting
=============

As the more pages to be freed in one batch, the long the duration will
be. So my attempt involves reducing the batch size. After I restrict the
batch to the smallest size, there is no complains on the latency spikes
any more.

However, duration my experiment, I found that the
CONFIG_PCP_BATCH_SCALE_MAX is hard to use in practice. So I try to
improve it in this series.

The Proposal
============

This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.

Future work
===========

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[0], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[1]. However, implementing
these solutions is likely to necessitate a more extended development
effort.

Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@xxxxxxxxxxxxxxxxxxxx/ [0]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@xxxxxxxxxxxxxxxxxxx/ [1]

Changes:
- v2->v3: 
  - commit log refinement
  - rebase it on mm-everything

- v1-> v2: https://lwn.net/Articles/983837/
  Commit log refinement

- v1: mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  https://lwn.net/Articles/981069/

- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the
  minimum pagelist
  https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@xxxxxxxxx/

Yafang Shao (3):
  mm/page_alloc: A minor fix to the calculation of pcp->free_count
  mm/page_alloc: Avoid changing pcp->high decaying when adjusting
    CONFIG_PCP_BATCH_SCALE_MAX
  mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

 Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++
 mm/Kconfig                              | 11 -------
 mm/page_alloc.c                         | 40 ++++++++++++++++++-------
 3 files changed, 47 insertions(+), 21 deletions(-)

-- 
2.34.1