Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area

Uladzislau Rezki <urezki@xxxxxxxxx> · Mon, 2 Sep 2024 19:00:44 +0200



On Mon, Sep 02, 2024 at 08:00:46PM +0800, Adrian Huang wrote:
> On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote:
> > atomic_long_add_return() might also introduce a high contention. We can
> > optimize by splitting into more light atomics. Can you check it on your
> > 448-cores system?
> 
> Interestingly, the following result shows the latency of
> free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms).
> 
> /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> 
>          usecs       : count     distribution
>          0 -> 1      : 18166     |                                        |
>          2 -> 3      : 41929818  |**                                      |
>          4 -> 7      : 181203439 |***********                             |
>          8 -> 15     : 464242836 |*****************************           |
>         16 -> 31     : 620077545 |****************************************|
>         32 -> 63     : 442133041 |****************************            |
>         64 -> 127    : 111432597 |*******                                 |
>        128 -> 255    : 3441649   |                                        |
>        256 -> 511    : 302655    |                                        |
>        512 -> 1023   : 738       |                                        |
>       1024 -> 2047   : 73        |                                        |
>       2048 -> 4095   : 0         |                                        |
>       4096 -> 8191   : 0         |                                        |
>       8192 -> 16383  : 0         |                                        |
>      16384 -> 32767  : 196       |                                        |
> 
>    avg = 26 usecs, total: 49415657269 usecs, count: 1864782753
> 
> 
> free_vmap_area_noflush() just executes the lock prefix one time, so the
> wrost case might be just about a hundred clock cycles.
> 
> The problem of purge_vmap_node() is that some cores are busy on purging
> each vmap_area of the *long* purge_list and executing atomic_long_sub()
> for each vmap_area, while other cores free vmalloc allocations and execute
> atomic_long_add_return() in free_vmap_area_noflush(). The following crash
> log shows the 22 cores are busy on purging vmap_area structs [1]:
> 
>   crash> bt -a | grep "purge_vmap_node+291" | wc -l
>   22
> 
> So, the latency of purge_vmap_node() dramatically increases becase it
> excutes the lock prefix over 600,0000 times. The issue can be easier
> to reproduce if more cores execute purge_vmap_node() simultaneously.
> 
Right. This is clear to me. Under heavy stressing in a tight loop we
invoke atomic_long_sub() per one freed VA. Having 448-cores and one
stress job per-cpu we end up with a high-contention spot when access
to an atomic which requires a cache-line lock.

> 
> 
> Tested the following patch with the light atomics. However, nothing improved 
> (But, the worst case is improved):
> 
>          usecs        : count     distribution
>          0 -> 1       : 7146      |                                        |
>          2 -> 3       : 31734187  |**                                      |
>          4 -> 7       : 161408609 |***********                             |
>          8 -> 15      : 461411377 |*********************************       |
>         16 -> 31      : 557005293 |****************************************|
>         32 -> 63      : 435518485 |*******************************         |
>         64 -> 127     : 175033097 |************                            |
>        128 -> 255     : 42265379  |***                                     |
>        256 -> 511     : 399112    |                                        |
>        512 -> 1023    : 734       |                                        |
>       1024 -> 2047    : 72        |                                        |
> 
> avg = 32 usecs, total: 59952713176 usecs, count: 1864783491
> 
Thank you for checking this! So there is no difference. As for worst
case, it might be an error of measurements. The problem is that we/you
measure the time which includes a context switch because a context which
triggers the free_vmap_area_noflush() function can easily be preempted.

--
Uladzislau Rezki