Yafang Shao <laoar.shao@xxxxxxxxx> writes: > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: >> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: >> >> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: >> >> >> >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@xxxxxxxxx> wrote: >> >> >> >> >> >> > Currently, we're encountering latency spikes in our container environment >> >> >> > when a specific container with multiple Python-based tasks exits. These >> >> >> > tasks may hold the zone->lock for an extended period, significantly >> >> >> > impacting latency for other containers attempting to allocate memory. >> >> >> >> >> >> Is this locking issue well understood? Is anyone working on it? A >> >> >> reasonably detailed description of the issue and a description of any >> >> >> ongoing work would be helpful here. >> >> > >> >> > In our containerized environment, we have a specific type of container >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These >> >> > processes are organized as separate processes rather than threads due >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a >> >> > multi-threaded setup. Upon the exit of these containers, other >> >> > containers hosted on the same machine experience significant latency >> >> > spikes. >> >> > >> >> > Our investigation using perf tracing revealed that the root cause of >> >> > these spikes is the simultaneous execution of exit_mmap() by each of >> >> > the exiting processes. This concurrent access to the zone->lock >> >> > results in contention, which becomes a hotspot and negatively impacts >> >> > performance. The perf results clearly indicate this contention as a >> >> > primary contributor to the observed latency issues. >> >> > >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] >> >> > [k] mmput ▒ >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] >> >> > [k] exit_mmap ▒ >> >> > - 76.97% exit_mmap >> >> > ▒ >> >> > - 58.58% unmap_vmas >> >> > ▒ >> >> > - 58.55% unmap_single_vma >> >> > ▒ >> >> > - unmap_page_range >> >> > ▒ >> >> > - 58.32% zap_pte_range >> >> > ▒ >> >> > - 42.88% tlb_flush_mmu >> >> > ▒ >> >> > - 42.76% free_pages_and_swap_cache >> >> > ▒ >> >> > - 41.22% release_pages >> >> > ▒ >> >> > - 33.29% free_unref_page_list >> >> > ▒ >> >> > - 32.37% free_unref_page_commit >> >> > ▒ >> >> > - 31.64% free_pcppages_bulk >> >> > ▒ >> >> > + 28.65% _raw_spin_lock >> >> > ▒ >> >> > 1.28% __list_del_entry_valid >> >> > ▒ >> >> > + 3.25% folio_lruvec_lock_irqsave >> >> > ▒ >> >> > + 0.75% __mem_cgroup_uncharge_list >> >> > ▒ >> >> > 0.60% __mod_lruvec_state >> >> > ▒ >> >> > 1.07% free_swap_cache >> >> > ▒ >> >> > + 11.69% page_remove_rmap >> >> > ▒ >> >> > 0.64% __mod_lruvec_page_state >> >> > - 17.34% remove_vma >> >> > ▒ >> >> > - 17.25% vm_area_free >> >> > ▒ >> >> > - 17.23% kmem_cache_free >> >> > ▒ >> >> > - 17.15% __slab_free >> >> > ▒ >> >> > - 14.56% discard_slab >> >> > ▒ >> >> > free_slab >> >> > ▒ >> >> > __free_slab >> >> > ▒ >> >> > __free_pages >> >> > ▒ >> >> > - free_unref_page >> >> > ▒ >> >> > - 13.50% free_unref_page_commit >> >> > ▒ >> >> > - free_pcppages_bulk >> >> > ▒ >> >> > + 13.44% _raw_spin_lock >> >> > >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack: >> >> > >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp >> >> > e=1 >> >> > <...>-1540432 [224] d..3. 618048.023887: <stack trace> >> >> > => free_pcppages_bulk >> >> > => free_unref_page_commit >> >> > => free_unref_page_list >> >> > => release_pages >> >> > => free_pages_and_swap_cache >> >> > => tlb_flush_mmu >> >> > => zap_pte_range >> >> > => unmap_page_range >> >> > => unmap_single_vma >> >> > => unmap_vmas >> >> > => exit_mmap >> >> > => mmput >> >> > => do_exit >> >> > => do_group_exit >> >> > => get_signal >> >> > => arch_do_signal_or_restart >> >> > => exit_to_user_mode_prepare >> >> > => syscall_exit_to_user_mode >> >> > => do_syscall_64 >> >> > => entry_SYSCALL_64_after_hwframe >> >> > >> >> > The servers experiencing these issues are equipped with impressive >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all >> >> > within a single NUMA node. The zoneinfo is as follows, >> >> > >> >> > Node 0, zone Normal >> >> > pages free 144465775 >> >> > boost 0 >> >> > min 1309270 >> >> > low 1636587 >> >> > high 1963904 >> >> > spanned 564133888 >> >> > present 296747008 >> >> > managed 291974346 >> >> > cma 0 >> >> > protection: (0, 0, 0, 0) >> >> > ... >> >> > ... >> >> > pagesets >> >> > cpu: 0 >> >> > count: 2217 >> >> > high: 6392 >> >> > batch: 63 >> >> > vm stats threshold: 125 >> >> > cpu: 1 >> >> > count: 4510 >> >> > high: 6392 >> >> > batch: 63 >> >> > vm stats threshold: 125 >> >> > cpu: 2 >> >> > count: 3059 >> >> > high: 6392 >> >> > batch: 63 >> >> > >> >> > ... >> >> > >> >> > The high is around 100 times the batch size. >> >> > >> >> > We also traced the latency associated with the free_pcppages_bulk() >> >> > function during the container exit process: >> >> > >> >> > 19:48:54 >> >> > nsecs : count distribution >> >> > 0 -> 1 : 0 | | >> >> > 2 -> 3 : 0 | | >> >> > 4 -> 7 : 0 | | >> >> > 8 -> 15 : 0 | | >> >> > 16 -> 31 : 0 | | >> >> > 32 -> 63 : 0 | | >> >> > 64 -> 127 : 0 | | >> >> > 128 -> 255 : 0 | | >> >> > 256 -> 511 : 148 |***************** | >> >> > 512 -> 1023 : 334 |****************************************| >> >> > 1024 -> 2047 : 33 |*** | >> >> > 2048 -> 4095 : 5 | | >> >> > 4096 -> 8191 : 7 | | >> >> > 8192 -> 16383 : 12 |* | >> >> > 16384 -> 32767 : 30 |*** | >> >> > 32768 -> 65535 : 21 |** | >> >> > 65536 -> 131071 : 15 |* | >> >> > 131072 -> 262143 : 27 |*** | >> >> > 262144 -> 524287 : 84 |********** | >> >> > 524288 -> 1048575 : 203 |************************ | >> >> > 1048576 -> 2097151 : 284 |********************************** | >> >> > 2097152 -> 4194303 : 327 |*************************************** | >> >> > 4194304 -> 8388607 : 215 |************************* | >> >> > 8388608 -> 16777215 : 116 |************* | >> >> > 16777216 -> 33554431 : 47 |***** | >> >> > 33554432 -> 67108863 : 8 | | >> >> > 67108864 -> 134217727 : 3 | | >> >> > >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920 >> >> > >> >> > The latency can reach tens of milliseconds. >> >> > >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the >> >> > minimum pagelist high at 4 times the batch size, we were able to >> >> > significantly reduce the latency associated with the >> >> > free_pcppages_bulk() function during container exits.: >> >> > >> >> > nsecs : count distribution >> >> > 0 -> 1 : 0 | | >> >> > 2 -> 3 : 0 | | >> >> > 4 -> 7 : 0 | | >> >> > 8 -> 15 : 0 | | >> >> > 16 -> 31 : 0 | | >> >> > 32 -> 63 : 0 | | >> >> > 64 -> 127 : 0 | | >> >> > 128 -> 255 : 120 | | >> >> > 256 -> 511 : 365 |* | >> >> > 512 -> 1023 : 201 | | >> >> > 1024 -> 2047 : 103 | | >> >> > 2048 -> 4095 : 84 | | >> >> > 4096 -> 8191 : 87 | | >> >> > 8192 -> 16383 : 4777 |************** | >> >> > 16384 -> 32767 : 10572 |******************************* | >> >> > 32768 -> 65535 : 13544 |****************************************| >> >> > 65536 -> 131071 : 12723 |************************************* | >> >> > 131072 -> 262143 : 8604 |************************* | >> >> > 262144 -> 524287 : 3659 |********** | >> >> > 524288 -> 1048575 : 921 |** | >> >> > 1048576 -> 2097151 : 122 | | >> >> > 2097152 -> 4194303 : 5 | | >> >> > >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925 >> >> > >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl >> >> > knob to set the minimum pagelist high at a level that effectively >> >> > mitigated latency issues, we observed that other containers were no >> >> > longer experiencing similar complaints. As a result, we decided to >> >> > implement this tuning as a permanent workaround and have deployed it >> >> > across all clusters of servers where these containers may be deployed. >> >> >> >> Thanks for your detailed data. >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting >> >> shouldn't be a problem? >> > >> > Right. The problem arises when the process holds the lock for too >> > long, causing other processes that are attempting to allocate memory >> > to experience delays or wait times. >> > >> >> Because users care more about the total time of >> >> process exiting, that is, throughput. And I suspect that the zone->lock >> >> contention and page allocating/freeing throughput will be worse with >> >> your configuration? >> > >> > While reducing throughput may not be a significant concern due to the >> > minimal difference, the potential for latency spikes, a crucial metric >> > for assessing system stability, is of greater concern to users. Higher >> > latency can lead to request errors, impacting the user experience. >> > Therefore, maintaining stability, even at the cost of slightly lower >> > throughput, is preferable to experiencing higher throughput with >> > unstable performance. >> > >> >> >> >> But the latency of free_pcppages_bulk() and page allocation in other >> >> processes is a problem. And your configuration can help it. >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In that way, >> >> you have a normal PCP size (high) but smaller PCP batch. I guess that >> >> may help both latency and throughput in your system. Could you give it >> >> a try? >> > >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX >> > configuration option. However, I've observed your recent improvements >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm: >> > restrict the pcp batch scale factor to avoid too long latency"), which >> > has prompted me to experiment with manually setting the >> > pcp->free_factor to zero. While this adjustment provided some >> > improvement, the results were not as significant as I had hoped. >> > >> > BTW, perhaps we should consider the implementation of a sysctl knob as >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users >> > to more easily adjust it. >> >> If you cannot test upstream behavior, it's hard to make changes to >> upstream. Could you find a way to do that? > > I'm afraid I can't run an upstream kernel in our production environment :( > Lots of code changes have to be made. Understand. Can you find a way to test upstream behavior, not upstream kernel exactly? Or test the upstream kernel but in a similar but not exactly production environment. >> IIUC, PCP high will not influence allocate/free latency, PCP batch will. > > It seems incorrect. > Looks at the code in free_unref_page_commit() : > > if (pcp->count >= high) { > free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), > pcp, pindex); > } > > And nr_pcp_free() : > min_nr_free = batch; > max_nr_free = high - batch; > > batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free); > return batch; > > The 'batch' is not a fixed value but changed dynamically, isn't it ? Sorry, my words were confusing. For 'batch', I mean the value of the "count" parameter of free_pcppages_bulk() actually. For example, if we change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that. >> Your configuration will influence PCP batch via configuration PCP high. >> So, it may be reasonable to find a way to adjust PCP batch directly. >> But, we need practical requirements and test methods first. >> -- Best Regards, Huang, Ying