Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 11 Jul 2024 15:21:43 +0800

On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>
> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
> >>
> >> > Background
> >> > ==========
> >> >
> >> > In our containerized environment, we have a specific type of container
> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> > processes are organized as separate processes rather than threads due
> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> > multi-threaded setup. Upon the exit of these containers, other
> >> > containers hosted on the same machine experience significant latency
> >> > spikes.
> >> >
> >> > Investigation
> >> > =============
> >> >
> >> > My investigation using perf tracing revealed that the root cause of
> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> > the exiting processes. This concurrent access to the zone->lock
> >> > results in contention, which becomes a hotspot and negatively impacts
> >> > performance. The perf results clearly indicate this contention as a
> >> > primary contributor to the observed latency issues.
> >> >
> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >    - 76.97% exit_mmap
> >> >       - 58.58% unmap_vmas
> >> >          - 58.55% unmap_single_vma
> >> >             - unmap_page_range
> >> >                - 58.32% zap_pte_range
> >> >                   - 42.88% tlb_flush_mmu
> >> >                      - 42.76% free_pages_and_swap_cache
> >> >                         - 41.22% release_pages
> >> >                            - 33.29% free_unref_page_list
> >> >                               - 32.37% free_unref_page_commit
> >> >                                  - 31.64% free_pcppages_bulk
> >> >                                     + 28.65% _raw_spin_lock
> >> >                                       1.28% __list_del_entry_valid
> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >                              0.60% __mod_lruvec_state
> >> >                           1.07% free_swap_cache
> >> >                   + 11.69% page_remove_rmap
> >> >                     0.64% __mod_lruvec_page_state
> >> >       - 17.34% remove_vma
> >> >          - 17.25% vm_area_free
> >> >             - 17.23% kmem_cache_free
> >> >                - 17.15% __slab_free
> >> >                   - 14.56% discard_slab
> >> >                        free_slab
> >> >                        __free_slab
> >> >                        __free_pages
> >> >                      - free_unref_page
> >> >                         - 13.50% free_unref_page_commit
> >> >                            - free_pcppages_bulk
> >> >                               + 13.44% _raw_spin_lock
> >>
> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> I don't find the value of the above data.
> >>
> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> > with the majority of them being regular order-0 user pages.
> >> >
> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> > e=1
> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >  => free_pcppages_bulk
> >> >  => free_unref_page_commit
> >> >  => free_unref_page_list
> >> >  => release_pages
> >> >  => free_pages_and_swap_cache
> >> >  => tlb_flush_mmu
> >> >  => zap_pte_range
> >> >  => unmap_page_range
> >> >  => unmap_single_vma
> >> >  => unmap_vmas
> >> >  => exit_mmap
> >> >  => mmput
> >> >  => do_exit
> >> >  => do_group_exit
> >> >  => get_signal
> >> >  => arch_do_signal_or_restart
> >> >  => exit_to_user_mode_prepare
> >> >  => syscall_exit_to_user_mode
> >> >  => do_syscall_64
> >> >  => entry_SYSCALL_64_after_hwframe
> >> >
> >> > The servers experiencing these issues are equipped with impressive
> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >
> >> > Node 0, zone   Normal
> >> >   pages free     144465775
> >> >         boost    0
> >> >         min      1309270
> >> >         low      1636587
> >> >         high     1963904
> >> >         spanned  564133888
> >> >         present  296747008
> >> >         managed  291974346
> >> >         cma      0
> >> >         protection: (0, 0, 0, 0)
> >> > ...
> >> >   pagesets
> >> >     cpu: 0
> >> >               count: 2217
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 1
> >> >               count: 4510
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 2
> >> >               count: 3059
> >> >               high:  6392
> >> >               batch: 63
> >> >
> >> > ...
> >> >
> >> > The pcp high is around 100 times the batch size.
> >> >
> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> > function during the container exit process:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 148      |*****************                       |
> >> >        512 -> 1023       : 334      |****************************************|
> >> >       1024 -> 2047       : 33       |***                                     |
> >> >       2048 -> 4095       : 5        |                                        |
> >> >       4096 -> 8191       : 7        |                                        |
> >> >       8192 -> 16383      : 12       |*                                       |
> >> >      16384 -> 32767      : 30       |***                                     |
> >> >      32768 -> 65535      : 21       |**                                      |
> >> >      65536 -> 131071     : 15       |*                                       |
> >> >     131072 -> 262143     : 27       |***                                     |
> >> >     262144 -> 524287     : 84       |**********                              |
> >> >     524288 -> 1048575    : 203      |************************                |
> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >
> >> > The latency can reach tens of milliseconds.
> >> >
> >> > Experimenting
> >> > =============
> >> >
> >> > vm.percpu_pagelist_high_fraction
> >> > --------------------------------
> >> >
> >> > The kernel version currently deployed in our production environment is the
> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >>
> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> description.  And I don't think that it's necessary to describe the
> >> alternative solution with too much details.
> >>
> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> > page draining, which subsequently leads to a substantial reduction in
> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> > improvement in latency.
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 120      |                                        |
> >> >        256 -> 511        : 365      |*                                       |
> >> >        512 -> 1023       : 201      |                                        |
> >> >       1024 -> 2047       : 103      |                                        |
> >> >       2048 -> 4095       : 84       |                                        |
> >> >       4096 -> 8191       : 87       |                                        |
> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >
> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> > have yet to observe any significant difference in throughput within our
> >> > production environment after implementing this change.
> >> >
> >> > Backporting the series "mm: PCP high auto-tuning"
> >> > -------------------------------------------------
> >>
> >> Again, not upstream activity.  We can describe the upstream behavior
> >> directly.
> >
> > Andrew has requested that I provide a more comprehensive analysis of
> > this issue, and in response, I have endeavored to outline all the
> > pertinent details in a thorough and detailed manner.
>
> IMHO, upstream activity can provide comprehensive analysis of the issue
> too.  And, your patch has changed much from the first version.  It's
> better to describe your current version.

After backporting the pcp auto-tuning feature to the 6.1.y branch, the
code is almost the same with the upstream kernel wrt the pcp. I have
thoroughly documented the detailed data showcasing the changes in the
backported version, providing a clear picture of the results. However,
it's crucial to note that I am unable to directly run the upstream
kernel on our production environment due to practical constraints.

>
> >>
> >> > My second endeavor was to backport the series titled
> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> > production environment, I noted a pronounced reduction in latency. The
> >> > observed outcomes are as enumerated below:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 0        |                                        |
> >> >        512 -> 1023       : 0        |                                        |
> >> >       1024 -> 2047       : 2        |                                        |
> >> >       2048 -> 4095       : 11       |                                        |
> >> >       4096 -> 8191       : 3        |                                        |
> >> >       8192 -> 16383      : 1        |                                        |
> >> >      16384 -> 32767      : 2        |                                        |
> >> >      32768 -> 65535      : 7        |                                        |
> >> >      65536 -> 131071     : 198      |*********                               |
> >> >     131072 -> 262143     : 530      |************************                |
> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >
> >> > Compared to the previous data, the maximum latency has been reduced to
> >> > less than 30ms.
> >>
> >> People don't care too much about page freeing latency during processes
> >> exiting.  Instead, they care more about the process exiting time, that
> >> is, throughput.  So, it's better to show the page allocation latency
> >> which is affected by the simultaneous processes exiting.
> >
> > I'm confused also. Is this issue really hard to understand ?
>
> IMHO, it's better to prove the issue directly.  If you cannot prove it
> directly, you can try alternative one and describe why.

Not all data can be verified straightforwardly or effortlessly. The
primary focus lies in the zone->lock contention, which necessitates
measuring the latency it incurs. To accomplish this, the
free_pcppages_bulk() function serves as an effective tool for
evaluation. Therefore, I have opted to specifically measure the
latency associated with free_pcppages_bulk().

The rationale behind not measuring allocation latency is due to the
necessity of finding a willing participant to endure potential delays,
a task that proved unsuccessful as no one expressed interest. In
contrast, assessing free_pcppages_bulk()'s latency solely requires
identifying and experimenting with the source causing the delays,
making it a more feasible approach.

-- 
Regards
Yafang