Yafang Shao <laoar.shao@xxxxxxxxx> writes: > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@xxxxxxxxx> wrote: >> >> > Currently, we're encountering latency spikes in our container environment >> > when a specific container with multiple Python-based tasks exits. These >> > tasks may hold the zone->lock for an extended period, significantly >> > impacting latency for other containers attempting to allocate memory. >> >> Is this locking issue well understood? Is anyone working on it? A >> reasonably detailed description of the issue and a description of any >> ongoing work would be helpful here. > > In our containerized environment, we have a specific type of container > that runs 18 processes, each consuming approximately 6GB of RSS. These > processes are organized as separate processes rather than threads due > to the Python Global Interpreter Lock (GIL) being a bottleneck in a > multi-threaded setup. Upon the exit of these containers, other > containers hosted on the same machine experience significant latency > spikes. > > Our investigation using perf tracing revealed that the root cause of > these spikes is the simultaneous execution of exit_mmap() by each of > the exiting processes. This concurrent access to the zone->lock > results in contention, which becomes a hotspot and negatively impacts > performance. The perf results clearly indicate this contention as a > primary contributor to the observed latency issues. > > + 77.02% 0.00% uwsgi [kernel.kallsyms] > [k] mmput ▒ > - 76.98% 0.01% uwsgi [kernel.kallsyms] > [k] exit_mmap ▒ > - 76.97% exit_mmap > ▒ > - 58.58% unmap_vmas > ▒ > - 58.55% unmap_single_vma > ▒ > - unmap_page_range > ▒ > - 58.32% zap_pte_range > ▒ > - 42.88% tlb_flush_mmu > ▒ > - 42.76% free_pages_and_swap_cache > ▒ > - 41.22% release_pages > ▒ > - 33.29% free_unref_page_list > ▒ > - 32.37% free_unref_page_commit > ▒ > - 31.64% free_pcppages_bulk > ▒ > + 28.65% _raw_spin_lock > ▒ > 1.28% __list_del_entry_valid > ▒ > + 3.25% folio_lruvec_lock_irqsave > ▒ > + 0.75% __mem_cgroup_uncharge_list > ▒ > 0.60% __mod_lruvec_state > ▒ > 1.07% free_swap_cache > ▒ > + 11.69% page_remove_rmap > ▒ > 0.64% __mod_lruvec_page_state > - 17.34% remove_vma > ▒ > - 17.25% vm_area_free > ▒ > - 17.23% kmem_cache_free > ▒ > - 17.15% __slab_free > ▒ > - 14.56% discard_slab > ▒ > free_slab > ▒ > __free_slab > ▒ > __free_pages > ▒ > - free_unref_page > ▒ > - 13.50% free_unref_page_commit > ▒ > - free_pcppages_bulk > ▒ > + 13.44% _raw_spin_lock > > By enabling the mm_page_pcpu_drain() we can find the detailed stack: > > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp > e=1 > <...>-1540432 [224] d..3. 618048.023887: <stack trace> > => free_pcppages_bulk > => free_unref_page_commit > => free_unref_page_list > => release_pages > => free_pages_and_swap_cache > => tlb_flush_mmu > => zap_pte_range > => unmap_page_range > => unmap_single_vma > => unmap_vmas > => exit_mmap > => mmput > => do_exit > => do_group_exit > => get_signal > => arch_do_signal_or_restart > => exit_to_user_mode_prepare > => syscall_exit_to_user_mode > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > The servers experiencing these issues are equipped with impressive > hardware specifications, including 256 CPUs and 1TB of memory, all > within a single NUMA node. The zoneinfo is as follows, > > Node 0, zone Normal > pages free 144465775 > boost 0 > min 1309270 > low 1636587 > high 1963904 > spanned 564133888 > present 296747008 > managed 291974346 > cma 0 > protection: (0, 0, 0, 0) > ... > ... > pagesets > cpu: 0 > count: 2217 > high: 6392 > batch: 63 > vm stats threshold: 125 > cpu: 1 > count: 4510 > high: 6392 > batch: 63 > vm stats threshold: 125 > cpu: 2 > count: 3059 > high: 6392 > batch: 63 > > ... > > The high is around 100 times the batch size. > > We also traced the latency associated with the free_pcppages_bulk() > function during the container exit process: > > 19:48:54 > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 148 |***************** | > 512 -> 1023 : 334 |****************************************| > 1024 -> 2047 : 33 |*** | > 2048 -> 4095 : 5 | | > 4096 -> 8191 : 7 | | > 8192 -> 16383 : 12 |* | > 16384 -> 32767 : 30 |*** | > 32768 -> 65535 : 21 |** | > 65536 -> 131071 : 15 |* | > 131072 -> 262143 : 27 |*** | > 262144 -> 524287 : 84 |********** | > 524288 -> 1048575 : 203 |************************ | > 1048576 -> 2097151 : 284 |********************************** | > 2097152 -> 4194303 : 327 |*************************************** | > 4194304 -> 8388607 : 215 |************************* | > 8388608 -> 16777215 : 116 |************* | > 16777216 -> 33554431 : 47 |***** | > 33554432 -> 67108863 : 8 | | > 67108864 -> 134217727 : 3 | | > > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920 > > The latency can reach tens of milliseconds. > > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the > minimum pagelist high at 4 times the batch size, we were able to > significantly reduce the latency associated with the > free_pcppages_bulk() function during container exits.: > > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 120 | | > 256 -> 511 : 365 |* | > 512 -> 1023 : 201 | | > 1024 -> 2047 : 103 | | > 2048 -> 4095 : 84 | | > 4096 -> 8191 : 87 | | > 8192 -> 16383 : 4777 |************** | > 16384 -> 32767 : 10572 |******************************* | > 32768 -> 65535 : 13544 |****************************************| > 65536 -> 131071 : 12723 |************************************* | > 131072 -> 262143 : 8604 |************************* | > 262144 -> 524287 : 3659 |********** | > 524288 -> 1048575 : 921 |** | > 1048576 -> 2097151 : 122 | | > 2097152 -> 4194303 : 5 | | > > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925 > > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl > knob to set the minimum pagelist high at a level that effectively > mitigated latency issues, we observed that other containers were no > longer experiencing similar complaints. As a result, we decided to > implement this tuning as a permanent workaround and have deployed it > across all clusters of servers where these containers may be deployed. Thanks for your detailed data. IIUC, the latency of free_pcppages_bulk() during process exiting shouldn't be a problem? Because users care more about the total time of process exiting, that is, throughput. And I suspect that the zone->lock contention and page allocating/freeing throughput will be worse with your configuration? But the latency of free_pcppages_bulk() and page allocation in other processes is a problem. And your configuration can help it. Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In that way, you have a normal PCP size (high) but smaller PCP batch. I guess that may help both latency and throughput in your system. Could you give it a try? [snip] -- Best Regards, Huang, Ying