[PATCH 0/9] Mitigate a vmap lock contention

"Uladzislau Rezki (Sony)" <urezki@xxxxxxxxx> · Mon, 22 May 2023 13:08:40 +0200

Hello, folk.

1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
conference. This small serial attempts to mitigate the contention across the
vmap/vmalloc code. The problem is described here:

wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

The material is tagged as a v2 version. It contains extra slides about testing
the throughput, steps and comparison with a current approach.

2. Motivation.

- The vmap code is not scalled to number of CPUs and this should be fixed;
- XFS folk has complained several times that vmalloc might be contented on
  their workloads:

<snip>
commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Tue Jan 4 17:22:18 2022 -0800

    xfs: reduce kvmalloc overhead for CIL shadow buffers

    Oh, let me count the ways that the kvmalloc API sucks dog eggs.

    The problem is when we are logging lots of large objects, we hit
    kvmalloc really damn hard with costly order allocations, and
    behaviour utterly sucks:
...
<snip>

- If we align with per-cpu allocator in terms of performance we can
  remove it to simplify the vmap code. Currently we have 3 allocators
  See:

<snip>
/*** Per cpu kva allocator ***/

/*
 * vmap space is limited especially on 32 bit architectures. Ensure there is
 * room for at least 16 percpu vmap blocks per CPU.
 */
/*
 * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
 * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
 * instead (we just need a rough idea)
 */
<snip>

3. Test

On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:

    1-page     1-page-this-patch
1  0.576131   vs   0.555889
2   2.68376   vs    1.07895
3   4.26502   vs    1.01739
4   6.04306   vs    1.28924
5   8.04786   vs    1.57616
6   9.38844   vs    1.78142
7   9.53481   vs    2.00172
8   10.4609   vs    2.15964
9   10.6089   vs      2.484
10  11.7372   vs    2.40443
11  11.5969   vs    2.71635
12   13.053   vs     2.6162
13  12.2973   vs      2.843
14  13.1429   vs    2.85714
15  13.7348   vs    2.90691
16  14.3687   vs     3.0285
17  14.8823   vs    3.05717
18  14.9571   vs    2.98018
19  14.9127   vs     3.0951
20  16.0786   vs    3.19521
21  15.8055   vs    3.24915
22  16.8087   vs     3.2521
23  16.7298   vs    3.25698
24   17.244   vs    3.36586
25  17.8416   vs    3.39384
26  18.8008   vs    3.40907
27  18.5591   vs     3.5254
28   19.761   vs    3.55468
29    20.06   vs    3.59869
30  20.4353   vs     3.6991
31  20.9082   vs    3.73028
32  21.0865   vs    3.82904

1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
pair throughput.

The series is based on the v6.3 tag and considered as beta version. Please note
it does not support vread() functionality yet. So it means that it is not fully
complete.

Any input/thoughts are welcome.

Uladzislau Rezki (Sony) (9):
  mm: vmalloc: Add va_alloc() helper
  mm: vmalloc: Rename adjust_va_to_fit_type() function
  mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  mm: vmalloc: Add a per-CPU-zone infrastructure
  mm: vmalloc: Insert busy-VA per-cpu zone
  mm: vmalloc: Support multiple zones in vmallocinfo
  mm: vmalloc: Insert lazy-VA per-cpu zone
  mm: vmalloc: Offload free_vmap_area_lock global lock
  mm: vmalloc: Scale and activate cvz_size

 mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 448 insertions(+), 193 deletions(-)

-- 
2.30.2