Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

"Huang, Ying" <ying.huang@xxxxxxxxx> · Wed, 10 Jul 2024 11:00:16 +0800

Yafang Shao <laoar.shao@xxxxxxxxx> writes:

> Background
> ==========
>
> In our containerized environment, we have a specific type of container
> that runs 18 processes, each consuming approximately 6GB of RSS. These
> processes are organized as separate processes rather than threads due
> to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> multi-threaded setup. Upon the exit of these containers, other
> containers hosted on the same machine experience significant latency
> spikes.
>
> Investigation
> =============
>
> My investigation using perf tracing revealed that the root cause of
> these spikes is the simultaneous execution of exit_mmap() by each of
> the exiting processes. This concurrent access to the zone->lock
> results in contention, which becomes a hotspot and negatively impacts
> performance. The perf results clearly indicate this contention as a
> primary contributor to the observed latency issues.
>
> +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>    - 76.97% exit_mmap
>       - 58.58% unmap_vmas
>          - 58.55% unmap_single_vma
>             - unmap_page_range
>                - 58.32% zap_pte_range
>                   - 42.88% tlb_flush_mmu
>                      - 42.76% free_pages_and_swap_cache
>                         - 41.22% release_pages
>                            - 33.29% free_unref_page_list
>                               - 32.37% free_unref_page_commit
>                                  - 31.64% free_pcppages_bulk
>                                     + 28.65% _raw_spin_lock
>                                       1.28% __list_del_entry_valid
>                            + 3.25% folio_lruvec_lock_irqsave
>                            + 0.75% __mem_cgroup_uncharge_list
>                              0.60% __mod_lruvec_state
>                           1.07% free_swap_cache
>                   + 11.69% page_remove_rmap
>                     0.64% __mod_lruvec_page_state
>       - 17.34% remove_vma
>          - 17.25% vm_area_free
>             - 17.23% kmem_cache_free
>                - 17.15% __slab_free
>                   - 14.56% discard_slab
>                        free_slab
>                        __free_slab
>                        __free_pages
>                      - free_unref_page
>                         - 13.50% free_unref_page_commit
>                            - free_pcppages_bulk
>                               + 13.44% _raw_spin_lock

I don't think your change will reduce zone->lock contention cycles.  So,
I don't find the value of the above data.

> By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> with the majority of them being regular order-0 user pages.
>
>           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> e=1
>            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>  => free_pcppages_bulk
>  => free_unref_page_commit
>  => free_unref_page_list
>  => release_pages
>  => free_pages_and_swap_cache
>  => tlb_flush_mmu
>  => zap_pte_range
>  => unmap_page_range
>  => unmap_single_vma
>  => unmap_vmas
>  => exit_mmap
>  => mmput
>  => do_exit
>  => do_group_exit
>  => get_signal
>  => arch_do_signal_or_restart
>  => exit_to_user_mode_prepare
>  => syscall_exit_to_user_mode
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
>
> The servers experiencing these issues are equipped with impressive
> hardware specifications, including 256 CPUs and 1TB of memory, all
> within a single NUMA node. The zoneinfo is as follows,
>
> Node 0, zone   Normal
>   pages free     144465775
>         boost    0
>         min      1309270
>         low      1636587
>         high     1963904
>         spanned  564133888
>         present  296747008
>         managed  291974346
>         cma      0
>         protection: (0, 0, 0, 0)
> ...
>   pagesets
>     cpu: 0
>               count: 2217
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 1
>               count: 4510
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 2
>               count: 3059
>               high:  6392
>               batch: 63
>
> ...
>
> The pcp high is around 100 times the batch size.
>
> I also traced the latency associated with the free_pcppages_bulk()
> function during the container exit process:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 148      |*****************                       |
>        512 -> 1023       : 334      |****************************************|
>       1024 -> 2047       : 33       |***                                     |
>       2048 -> 4095       : 5        |                                        |
>       4096 -> 8191       : 7        |                                        |
>       8192 -> 16383      : 12       |*                                       |
>      16384 -> 32767      : 30       |***                                     |
>      32768 -> 65535      : 21       |**                                      |
>      65536 -> 131071     : 15       |*                                       |
>     131072 -> 262143     : 27       |***                                     |
>     262144 -> 524287     : 84       |**********                              |
>     524288 -> 1048575    : 203      |************************                |
>    1048576 -> 2097151    : 284      |**********************************      |
>    2097152 -> 4194303    : 327      |*************************************** |
>    4194304 -> 8388607    : 215      |*************************               |
>    8388608 -> 16777215   : 116      |*************                           |
>   16777216 -> 33554431   : 47       |*****                                   |
>   33554432 -> 67108863   : 8        |                                        |
>   67108864 -> 134217727  : 3        |                                        |
>
> The latency can reach tens of milliseconds.
>
> Experimenting
> =============
>
> vm.percpu_pagelist_high_fraction
> --------------------------------
>
> The kernel version currently deployed in our production environment is the
> stable 6.1.y, and my initial strategy involves optimizing the

IMHO, we should focus on upstream activity in the cover letter and patch
description.  And I don't think that it's necessary to describe the
alternative solution with too much details.

> vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> page draining, which subsequently leads to a substantial reduction in
> latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> improvement in latency.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 120      |                                        |
>        256 -> 511        : 365      |*                                       |
>        512 -> 1023       : 201      |                                        |
>       1024 -> 2047       : 103      |                                        |
>       2048 -> 4095       : 84       |                                        |
>       4096 -> 8191       : 87       |                                        |
>       8192 -> 16383      : 4777     |**************                          |
>      16384 -> 32767      : 10572    |*******************************         |
>      32768 -> 65535      : 13544    |****************************************|
>      65536 -> 131071     : 12723    |*************************************   |
>     131072 -> 262143     : 8604     |*************************               |
>     262144 -> 524287     : 3659     |**********                              |
>     524288 -> 1048575    : 921      |**                                      |
>    1048576 -> 2097151    : 122      |                                        |
>    2097152 -> 4194303    : 5        |                                        |
>
> However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> pcp high watermark size to a minimum of four times the batch size. While
> this could theoretically affect throughput, as highlighted by Ying[0], we
> have yet to observe any significant difference in throughput within our
> production environment after implementing this change.
>
> Backporting the series "mm: PCP high auto-tuning"
> -------------------------------------------------

Again, not upstream activity.  We can describe the upstream behavior
directly.

> My second endeavor was to backport the series titled
> "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> into our 6.1.y stable kernel version. Subsequent to its deployment in our
> production environment, I noted a pronounced reduction in latency. The
> observed outcomes are as enumerated below:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 2        |                                        |
>       2048 -> 4095       : 11       |                                        |
>       4096 -> 8191       : 3        |                                        |
>       8192 -> 16383      : 1        |                                        |
>      16384 -> 32767      : 2        |                                        |
>      32768 -> 65535      : 7        |                                        |
>      65536 -> 131071     : 198      |*********                               |
>     131072 -> 262143     : 530      |************************                |
>     262144 -> 524287     : 824      |**************************************  |
>     524288 -> 1048575    : 852      |****************************************|
>    1048576 -> 2097151    : 714      |*********************************       |
>    2097152 -> 4194303    : 389      |******************                      |
>    4194304 -> 8388607    : 143      |******                                  |
>    8388608 -> 16777215   : 29       |*                                       |
>   16777216 -> 33554431   : 1        |                                        |
>
> Compared to the previous data, the maximum latency has been reduced to
> less than 30ms.

People don't care too much about page freeing latency during processes
exiting.  Instead, they care more about the process exiting time, that
is, throughput.  So, it's better to show the page allocation latency
which is affected by the simultaneous processes exiting.

> Adjusting the CONFIG_PCP_BATCH_SCALE_MAX
> ----------------------------------------
>
> Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can
> potentially reduce the PCP batch size without compromising the PCP high
> watermark size. This approach could mitigate latency spikes without
> adversely affecting throughput. Consequently, my third attempt focused on
> modifying this configuration.
>
> To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX
> with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning
> vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a
> further reduction in the maximum latency, which was lowered to less than
> 2ms:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 36       |                                        |
>       2048 -> 4095       : 5063     |*****                                   |
>       4096 -> 8191       : 31226    |********************************        |
>       8192 -> 16383      : 37606    |*************************************** |
>      16384 -> 32767      : 38359    |****************************************|
>      32768 -> 65535      : 30652    |*******************************         |
>      65536 -> 131071     : 18714    |*******************                     |
>     131072 -> 262143     : 7968     |********                                |
>     262144 -> 524287     : 1996     |**                                      |
>     524288 -> 1048575    : 302      |                                        |
>    1048576 -> 2097151    : 19       |                                        |
>
> After multiple trials, I observed no significant differences between
> each attempt.
>
> The Proposal
> ============
>
> This series encompasses two minor refinements to the PCP high watermark
> auto-tuning mechanism, along with the introduction of a new sysctl knob
> that serves as a more practical alternative to the previous configuration
> method.
>
> Future improvement to zone->lock
> ================================
>
> To ultimately mitigate the zone->lock contention issue, several suggestions
> have been proposed. One approach involves dividing large zones into multi
> smaller zones, as suggested by Matthew[2], while another entails splitting
> the zone->lock using a mechanism similar to memory arenas and shifting away
> from relying solely on zone_id to identify the range of free lists a
> particular page belongs to[3]. However, implementing these solutions is
> likely to necessitate a more extended development effort.
>
> Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [0]
> Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@xxxxxxxxx/ [1]
> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@xxxxxxxxxxxxxxxxxxxx/ [2]
> Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@xxxxxxxxxxxxxxxxxxx/ [3]
>
> Changes:
> - mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
>   https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@xxxxxxxxx/
>
> Yafang Shao (3):
>   mm/page_alloc: A minor fix to the calculation of pcp->free_count
>   mm/page_alloc: Avoid changing pcp->high decaying when adjusting
>     CONFIG_PCP_BATCH_SCALE_MAX
>   mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
>
>  Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++
>  include/linux/sysctl.h                  |  1 +
>  kernel/sysctl.c                         |  2 +-
>  mm/Kconfig                              | 11 -------
>  mm/page_alloc.c                         | 38 ++++++++++++++++++-------
>  5 files changed, 45 insertions(+), 22 deletions(-)

--
Best Regards,
Huang, Ying