Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area

Uladzislau Rezki <urezki@xxxxxxxxx> · Thu, 29 Aug 2024 21:00:16 +0200

On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> From: Adrian Huang <ahuang12@xxxxxxxxxx>
> 
> When running the vmalloc stress on a 448-core system, observe the average
> latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> 'funclatency.py' tool [1].
> 
>   # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> 
>      usecs             : count    distribution
>         0 -> 1         : 0       |                                        |
>         2 -> 3         : 29      |                                        |
>         4 -> 7         : 19      |                                        |
>         8 -> 15        : 56      |                                        |
>        16 -> 31        : 483     |****                                    |
>        32 -> 63        : 1548    |************                            |
>        64 -> 127       : 2634    |*********************                   |
>       128 -> 255       : 2535    |*********************                   |
>       256 -> 511       : 1776    |**************                          |
>       512 -> 1023      : 1015    |********                                |
>      1024 -> 2047      : 573     |****                                    |
>      2048 -> 4095      : 488     |****                                    |
>      4096 -> 8191      : 1091    |*********                               |
>      8192 -> 16383     : 3078    |*************************               |
>     16384 -> 32767     : 4821    |****************************************|
>     32768 -> 65535     : 3318    |***************************             |
>     65536 -> 131071    : 1718    |**************                          |
>    131072 -> 262143    : 2220    |******************                      |
>    262144 -> 524287    : 1147    |*********                               |
>    524288 -> 1048575   : 1179    |*********                               |
>   1048576 -> 2097151   : 822     |******                                  |
>   2097152 -> 4194303   : 906     |*******                                 |
>   4194304 -> 8388607   : 2148    |*****************                       |
>   8388608 -> 16777215  : 4497    |*************************************   |
>  16777216 -> 33554431  : 289     |**                                      |
> 
>   avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
> 
>   The worst case is over 16-33 seconds, so soft lockup is triggered [2].
> 
> [Root Cause]
> 1) Each purge_list has the long list. The following shows the number of
>    vmap_area is purged.
> 
>    crash> p vmap_nodes
>    vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
>    crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
>      nr_purged = 663070
>      ...
>      nr_purged = 821670
>      nr_purged = 692214
>      nr_purged = 726808
>      ...
> 
> 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
>    operation when purging each vmap_area. However, the iteration is over
>    600000 vmap_area (See 'nr_purged' above).
> 
>    Here is objdump output:
> 
>      $ objdump -D vmlinux
>      ffffffff813e8c80 <purge_vmap_node>:
>      ...
>      ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
>      ...
> 
>    Quote from "Instruction tables" pdf file [3]:
>      Instructions with a LOCK prefix have a long latency that depends on
>      cache organization and possibly RAM speed. If there are multiple
>      processors or cores or direct memory access (DMA) devices, then all
>      locked instructions will lock a cache line for exclusive access,
>      which may involve RAM access. A LOCK prefix typically costs more
>      than a hundred clock cycles, even on single-processor systems.
> 
>    That's why the latency of purge_vmap_node() dramatically increases
>    on a many-core system: One core is busy on purging each vmap_area of
>    the *long* purge_list and executing atomic_long_sub() for each
>    vmap_area, while other cores free vmalloc allocations and execute
>    atomic_long_add_return() in free_vmap_area_noflush().
> 
> [Solution]
> Employ a local variable to record the total purged pages, and execute
> atomic_long_sub() after the traversal of the purge_list is done. The
> experiment result shows the latency improvement is 99%.
> 
> [Experiment Result]
> 1) System Configuration: Three servers (with HT-enabled) are tested.
>      * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
>      * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
>      * 448-core server: AMD Zen 4 Processor*2
> 
> 2) Kernel Config
>      * CONFIG_KASAN is disabled
> 
> 3) The data in column "w/o patch" and "w/ patch"
>      * Unit: micro seconds (us)
>      * Each data is the average of 3-time measurements
> 
>          System        w/o patch (us)   w/ patch (us)    Improvement (%)
>      ---------------   --------------   -------------    -------------
>      72-core server          2194              14            99.36%
>      192-core server       143799            1139            99.21%
>      448-core server      1992122            6883            99.65%
> 
> [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 
> [3] https://www.agner.org/optimize/instruction_tables.pdf
> 
> Signed-off-by: Adrian Huang <ahuang12@xxxxxxxxxx>
> ---
>  mm/vmalloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f9b6bd707d2..607697c81e60 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
>  {
>  	struct vmap_node *vn = container_of(work,
>  		struct vmap_node, purge_work);
> +	unsigned long nr_purged_pages = 0;
>  	struct vmap_area *va, *n_va;
>  	LIST_HEAD(local_list);
>  
> @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
>  
>  		list_del_init(&va->list);
>  
> -		atomic_long_sub(nr, &vmap_lazy_nr);
> +		nr_purged_pages += nr;
>  		vn->nr_purged++;
>  
>  		if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
>  		list_add(&va->list, &local_list);
>  	}
>  
> +	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> +
>  	reclaim_list_global(&local_list);
>  }
>  
> -- 
> 2.34.1
> 
I see the point and it looks good to me.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>

Thank you for improving this. There is one more spot which i detected
earlier, it is:

<snip>
static void free_vmap_area_noflush(struct vmap_area *va)
{
	unsigned long nr_lazy_max = lazy_max_pages();
	unsigned long va_start = va->va_start;
	unsigned int vn_id = decode_vn_id(va->flags);
	struct vmap_node *vn;
	unsigned long nr_lazy;

	if (WARN_ON_ONCE(!list_empty(&va->list)))
		return;

	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
				PAGE_SHIFT, &vmap_lazy_nr);

...
<snip>

atomic_long_add_return() might also introduce a high contention. We can
optimize by splitting into more light atomics. Can you check it on your
448-cores system?

Tnanks!

--
Uladzislau Rezki