On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Yafang Shao <laoar.shao@xxxxxxxxx> writes: > > > Background > > ========== > > > > In our containerized environment, we have a specific type of container > > that runs 18 processes, each consuming approximately 6GB of RSS. These > > processes are organized as separate processes rather than threads due > > to the Python Global Interpreter Lock (GIL) being a bottleneck in a > > multi-threaded setup. Upon the exit of these containers, other > > containers hosted on the same machine experience significant latency > > spikes. > > > > Investigation > > ============= > > > > My investigation using perf tracing revealed that the root cause of > > these spikes is the simultaneous execution of exit_mmap() by each of > > the exiting processes. This concurrent access to the zone->lock > > results in contention, which becomes a hotspot and negatively impacts > > performance. The perf results clearly indicate this contention as a > > primary contributor to the observed latency issues. > > > > + 77.02% 0.00% uwsgi [kernel.kallsyms] [k] mmput > > - 76.98% 0.01% uwsgi [kernel.kallsyms] [k] exit_mmap > > - 76.97% exit_mmap > > - 58.58% unmap_vmas > > - 58.55% unmap_single_vma > > - unmap_page_range > > - 58.32% zap_pte_range > > - 42.88% tlb_flush_mmu > > - 42.76% free_pages_and_swap_cache > > - 41.22% release_pages > > - 33.29% free_unref_page_list > > - 32.37% free_unref_page_commit > > - 31.64% free_pcppages_bulk > > + 28.65% _raw_spin_lock > > 1.28% __list_del_entry_valid > > + 3.25% folio_lruvec_lock_irqsave > > + 0.75% __mem_cgroup_uncharge_list > > 0.60% __mod_lruvec_state > > 1.07% free_swap_cache > > + 11.69% page_remove_rmap > > 0.64% __mod_lruvec_page_state > > - 17.34% remove_vma > > - 17.25% vm_area_free > > - 17.23% kmem_cache_free > > - 17.15% __slab_free > > - 14.56% discard_slab > > free_slab > > __free_slab > > __free_pages > > - free_unref_page > > - 13.50% free_unref_page_commit > > - free_pcppages_bulk > > + 13.44% _raw_spin_lock > > I don't think your change will reduce zone->lock contention cycles. So, > I don't find the value of the above data. > > > By enabling the mm_page_pcpu_drain() we can locate the pertinent page, > > with the majority of them being regular order-0 user pages. > > > > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp > > e=1 > > <...>-1540432 [224] d..3. 618048.023887: <stack trace> > > => free_pcppages_bulk > > => free_unref_page_commit > > => free_unref_page_list > > => release_pages > > => free_pages_and_swap_cache > > => tlb_flush_mmu > > => zap_pte_range > > => unmap_page_range > > => unmap_single_vma > > => unmap_vmas > > => exit_mmap > > => mmput > > => do_exit > > => do_group_exit > > => get_signal > > => arch_do_signal_or_restart > > => exit_to_user_mode_prepare > > => syscall_exit_to_user_mode > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > The servers experiencing these issues are equipped with impressive > > hardware specifications, including 256 CPUs and 1TB of memory, all > > within a single NUMA node. The zoneinfo is as follows, > > > > Node 0, zone Normal > > pages free 144465775 > > boost 0 > > min 1309270 > > low 1636587 > > high 1963904 > > spanned 564133888 > > present 296747008 > > managed 291974346 > > cma 0 > > protection: (0, 0, 0, 0) > > ... > > pagesets > > cpu: 0 > > count: 2217 > > high: 6392 > > batch: 63 > > vm stats threshold: 125 > > cpu: 1 > > count: 4510 > > high: 6392 > > batch: 63 > > vm stats threshold: 125 > > cpu: 2 > > count: 3059 > > high: 6392 > > batch: 63 > > > > ... > > > > The pcp high is around 100 times the batch size. > > > > I also traced the latency associated with the free_pcppages_bulk() > > function during the container exit process: > > > > nsecs : count distribution > > 0 -> 1 : 0 | | > > 2 -> 3 : 0 | | > > 4 -> 7 : 0 | | > > 8 -> 15 : 0 | | > > 16 -> 31 : 0 | | > > 32 -> 63 : 0 | | > > 64 -> 127 : 0 | | > > 128 -> 255 : 0 | | > > 256 -> 511 : 148 |***************** | > > 512 -> 1023 : 334 |****************************************| > > 1024 -> 2047 : 33 |*** | > > 2048 -> 4095 : 5 | | > > 4096 -> 8191 : 7 | | > > 8192 -> 16383 : 12 |* | > > 16384 -> 32767 : 30 |*** | > > 32768 -> 65535 : 21 |** | > > 65536 -> 131071 : 15 |* | > > 131072 -> 262143 : 27 |*** | > > 262144 -> 524287 : 84 |********** | > > 524288 -> 1048575 : 203 |************************ | > > 1048576 -> 2097151 : 284 |********************************** | > > 2097152 -> 4194303 : 327 |*************************************** | > > 4194304 -> 8388607 : 215 |************************* | > > 8388608 -> 16777215 : 116 |************* | > > 16777216 -> 33554431 : 47 |***** | > > 33554432 -> 67108863 : 8 | | > > 67108864 -> 134217727 : 3 | | > > > > The latency can reach tens of milliseconds. > > > > Experimenting > > ============= > > > > vm.percpu_pagelist_high_fraction > > -------------------------------- > > > > The kernel version currently deployed in our production environment is the > > stable 6.1.y, and my initial strategy involves optimizing the > > IMHO, we should focus on upstream activity in the cover letter and patch > description. And I don't think that it's necessary to describe the > alternative solution with too much details. > > > vm.percpu_pagelist_high_fraction parameter. By increasing the value of > > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during > > page draining, which subsequently leads to a substantial reduction in > > latency. After setting the sysctl value to 0x7fffffff, I observed a notable > > improvement in latency. > > > > nsecs : count distribution > > 0 -> 1 : 0 | | > > 2 -> 3 : 0 | | > > 4 -> 7 : 0 | | > > 8 -> 15 : 0 | | > > 16 -> 31 : 0 | | > > 32 -> 63 : 0 | | > > 64 -> 127 : 0 | | > > 128 -> 255 : 120 | | > > 256 -> 511 : 365 |* | > > 512 -> 1023 : 201 | | > > 1024 -> 2047 : 103 | | > > 2048 -> 4095 : 84 | | > > 4096 -> 8191 : 87 | | > > 8192 -> 16383 : 4777 |************** | > > 16384 -> 32767 : 10572 |******************************* | > > 32768 -> 65535 : 13544 |****************************************| > > 65536 -> 131071 : 12723 |************************************* | > > 131072 -> 262143 : 8604 |************************* | > > 262144 -> 524287 : 3659 |********** | > > 524288 -> 1048575 : 921 |** | > > 1048576 -> 2097151 : 122 | | > > 2097152 -> 4194303 : 5 | | > > > > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the > > pcp high watermark size to a minimum of four times the batch size. While > > this could theoretically affect throughput, as highlighted by Ying[0], we > > have yet to observe any significant difference in throughput within our > > production environment after implementing this change. > > > > Backporting the series "mm: PCP high auto-tuning" > > ------------------------------------------------- > > Again, not upstream activity. We can describe the upstream behavior > directly. Andrew has requested that I provide a more comprehensive analysis of this issue, and in response, I have endeavored to outline all the pertinent details in a thorough and detailed manner. > > > My second endeavor was to backport the series titled > > "mm: PCP high auto-tuning"[1], which comprises nine individual patches, > > into our 6.1.y stable kernel version. Subsequent to its deployment in our > > production environment, I noted a pronounced reduction in latency. The > > observed outcomes are as enumerated below: > > > > nsecs : count distribution > > 0 -> 1 : 0 | | > > 2 -> 3 : 0 | | > > 4 -> 7 : 0 | | > > 8 -> 15 : 0 | | > > 16 -> 31 : 0 | | > > 32 -> 63 : 0 | | > > 64 -> 127 : 0 | | > > 128 -> 255 : 0 | | > > 256 -> 511 : 0 | | > > 512 -> 1023 : 0 | | > > 1024 -> 2047 : 2 | | > > 2048 -> 4095 : 11 | | > > 4096 -> 8191 : 3 | | > > 8192 -> 16383 : 1 | | > > 16384 -> 32767 : 2 | | > > 32768 -> 65535 : 7 | | > > 65536 -> 131071 : 198 |********* | > > 131072 -> 262143 : 530 |************************ | > > 262144 -> 524287 : 824 |************************************** | > > 524288 -> 1048575 : 852 |****************************************| > > 1048576 -> 2097151 : 714 |********************************* | > > 2097152 -> 4194303 : 389 |****************** | > > 4194304 -> 8388607 : 143 |****** | > > 8388608 -> 16777215 : 29 |* | > > 16777216 -> 33554431 : 1 | | > > > > Compared to the previous data, the maximum latency has been reduced to > > less than 30ms. > > People don't care too much about page freeing latency during processes > exiting. Instead, they care more about the process exiting time, that > is, throughput. So, it's better to show the page allocation latency > which is affected by the simultaneous processes exiting. I'm confused also. Is this issue really hard to understand ? -- Regards Yafang