On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > atomic_long_add_return() might also introduce a high contention. We can > optimize by splitting into more light atomics. Can you check it on your > 448-cores system? Interestingly, the following result shows the latency of free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms). /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 usecs : count distribution 0 -> 1 : 18166 | | 2 -> 3 : 41929818 |** | 4 -> 7 : 181203439 |*********** | 8 -> 15 : 464242836 |***************************** | 16 -> 31 : 620077545 |****************************************| 32 -> 63 : 442133041 |**************************** | 64 -> 127 : 111432597 |******* | 128 -> 255 : 3441649 | | 256 -> 511 : 302655 | | 512 -> 1023 : 738 | | 1024 -> 2047 : 73 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 196 | | avg = 26 usecs, total: 49415657269 usecs, count: 1864782753 free_vmap_area_noflush() just executes the lock prefix one time, so the wrost case might be just about a hundred clock cycles. The problem of purge_vmap_node() is that some cores are busy on purging each vmap_area of the *long* purge_list and executing atomic_long_sub() for each vmap_area, while other cores free vmalloc allocations and execute atomic_long_add_return() in free_vmap_area_noflush(). The following crash log shows the 22 cores are busy on purging vmap_area structs [1]: crash> bt -a | grep "purge_vmap_node+291" | wc -l 22 So, the latency of purge_vmap_node() dramatically increases becase it excutes the lock prefix over 600,0000 times. The issue can be easier to reproduce if more cores execute purge_vmap_node() simultaneously. [1] https://gist.github.com/AdrianHuang/c0030dd7755e673ed00cb197b76ce0a7 Tested the following patch with the light atomics. However, nothing improved (But, the worst case is improved): usecs : count distribution 0 -> 1 : 7146 | | 2 -> 3 : 31734187 |** | 4 -> 7 : 161408609 |*********** | 8 -> 15 : 461411377 |********************************* | 16 -> 31 : 557005293 |****************************************| 32 -> 63 : 435518485 |******************************* | 64 -> 127 : 175033097 |************ | 128 -> 255 : 42265379 |*** | 256 -> 511 : 399112 | | 512 -> 1023 : 734 | | 1024 -> 2047 : 72 | | avg = 32 usecs, total: 59952713176 usecs, count: 1864783491 [free_vmap_area_noflush() assembly instructions wo/ the test patch] ffffffff813e6e90 <free_vmap_area_noflush>: ... ffffffff813e6ed4: 4c 8b 65 08 mov 0x8(%rbp),%r12 ffffffff813e6ed8: 4c 2b 65 00 sub 0x0(%rbp),%r12 ffffffff813e6edc: 49 c1 ec 0c shr $0xc,%r12 ffffffff813e6ee0: 4c 89 e2 mov %r12,%rdx ffffffff813e6ee3: f0 48 0f c1 15 f4 2a lock xadd %rdx,0x2bb2af4(%rip) # ffffffff83f999e0 <vmap_lazy_nr> ffffffff813e6eea: bb 02 ffffffff813e6eec: 8b 0d 0e b1 a5 01 mov 0x1a5b10e(%rip),%ecx # ffffffff82e42000 <nr_vmap_nodes> ffffffff813e6ef2: 49 01 d4 add %rdx,%r12 ffffffff813e6ef5: 39 c8 cmp %ecx,%eax ... [free_vmap_area_noflush() assembly instructions w/ the test patch: no lock prefix] ffffffff813e6e90 <free_vmap_area_noflush>: ... ffffffff813e6edb: 48 8b 5d 08 mov 0x8(%rbp),%rbx ffffffff813e6edf: 48 2b 5d 00 sub 0x0(%rbp),%rbx ffffffff813e6ee3: 8b 0d d7 b0 a5 01 mov 0x1a5b0d7(%rip),%ecx # ffffffff82e41fc0 <nr_vmap_nodes> ffffffff813e6ee9: 48 c1 eb 0c shr $0xc,%rbx ffffffff813e6eed: 4c 8b 25 b4 f1 92 01 mov 0x192f1b4(%rip),%r12 # ffffffff82d160a8 <vmap_nodes> ffffffff813e6ef4: 48 01 d3 add %rdx,%rbx ffffffff813e6ef7: 48 89 1d e2 2a bb 02 mov %rbx,0x2bb2ae2(%rip) # ffffffff83f999e0 <vmap_lazy_nr> ffffffff813e6efe: 39 c8 cmp %ecx,%eax ... diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 3f9b6bd707d2..3927c541440b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2357,8 +2357,9 @@ static void free_vmap_area_noflush(struct vmap_area *va) if (WARN_ON_ONCE(!list_empty(&va->list))) return; - nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> - PAGE_SHIFT, &vmap_lazy_nr); + nr_lazy = atomic_long_read(&vmap_lazy_nr); + nr_lazy += (va->va_end - va->va_start) >> PAGE_SHIFT; + atomic_long_set(&vmap_lazy_nr, nr_lazy); /* * If it was request by a certain node we would like to