On Mon, Sep 02, 2024 at 08:00:46PM +0800, Adrian Huang wrote: > On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > atomic_long_add_return() might also introduce a high contention. We can > > optimize by splitting into more light atomics. Can you check it on your > > 448-cores system? > > Interestingly, the following result shows the latency of > free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms). > > /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 > > usecs : count distribution > 0 -> 1 : 18166 | | > 2 -> 3 : 41929818 |** | > 4 -> 7 : 181203439 |*********** | > 8 -> 15 : 464242836 |***************************** | > 16 -> 31 : 620077545 |****************************************| > 32 -> 63 : 442133041 |**************************** | > 64 -> 127 : 111432597 |******* | > 128 -> 255 : 3441649 | | > 256 -> 511 : 302655 | | > 512 -> 1023 : 738 | | > 1024 -> 2047 : 73 | | > 2048 -> 4095 : 0 | | > 4096 -> 8191 : 0 | | > 8192 -> 16383 : 0 | | > 16384 -> 32767 : 196 | | > > avg = 26 usecs, total: 49415657269 usecs, count: 1864782753 > > > free_vmap_area_noflush() just executes the lock prefix one time, so the > wrost case might be just about a hundred clock cycles. > > The problem of purge_vmap_node() is that some cores are busy on purging > each vmap_area of the *long* purge_list and executing atomic_long_sub() > for each vmap_area, while other cores free vmalloc allocations and execute > atomic_long_add_return() in free_vmap_area_noflush(). The following crash > log shows the 22 cores are busy on purging vmap_area structs [1]: > > crash> bt -a | grep "purge_vmap_node+291" | wc -l > 22 > > So, the latency of purge_vmap_node() dramatically increases becase it > excutes the lock prefix over 600,0000 times. The issue can be easier > to reproduce if more cores execute purge_vmap_node() simultaneously. > Right. This is clear to me. Under heavy stressing in a tight loop we invoke atomic_long_sub() per one freed VA. Having 448-cores and one stress job per-cpu we end up with a high-contention spot when access to an atomic which requires a cache-line lock. > > > Tested the following patch with the light atomics. However, nothing improved > (But, the worst case is improved): > > usecs : count distribution > 0 -> 1 : 7146 | | > 2 -> 3 : 31734187 |** | > 4 -> 7 : 161408609 |*********** | > 8 -> 15 : 461411377 |********************************* | > 16 -> 31 : 557005293 |****************************************| > 32 -> 63 : 435518485 |******************************* | > 64 -> 127 : 175033097 |************ | > 128 -> 255 : 42265379 |*** | > 256 -> 511 : 399112 | | > 512 -> 1023 : 734 | | > 1024 -> 2047 : 72 | | > > avg = 32 usecs, total: 59952713176 usecs, count: 1864783491 > Thank you for checking this! So there is no difference. As for worst case, it might be an error of measurements. The problem is that we/you measure the time which includes a context switch because a context which triggers the free_vmap_area_noflush() function can easily be preempted. -- Uladzislau Rezki