Hello, folk. 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023 conference. This small serial attempts to mitigate the contention across the vmap/vmalloc code. The problem is described here: wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf The material is tagged as a v2 version. It contains extra slides about testing the throughput, steps and comparison with a current approach. 2. Motivation. - The vmap code is not scalled to number of CPUs and this should be fixed; - XFS folk has complained several times that vmalloc might be contented on their workloads: <snip> commit 8dc9384b7d75012856b02ff44c37566a55fc2abf Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue Jan 4 17:22:18 2022 -0800 xfs: reduce kvmalloc overhead for CIL shadow buffers Oh, let me count the ways that the kvmalloc API sucks dog eggs. The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks: ... <snip> - If we align with per-cpu allocator in terms of performance we can remove it to simplify the vmap code. Currently we have 3 allocators See: <snip> /*** Per cpu kva allocator ***/ /* * vmap space is limited especially on 32 bit architectures. Ensure there is * room for at least 16 percpu vmap blocks per CPU. */ /* * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess * instead (we just need a rough idea) */ <snip> 3. Test On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures: 1-page 1-page-this-patch 1 0.576131 vs 0.555889 2 2.68376 vs 1.07895 3 4.26502 vs 1.01739 4 6.04306 vs 1.28924 5 8.04786 vs 1.57616 6 9.38844 vs 1.78142 7 9.53481 vs 2.00172 8 10.4609 vs 2.15964 9 10.6089 vs 2.484 10 11.7372 vs 2.40443 11 11.5969 vs 2.71635 12 13.053 vs 2.6162 13 12.2973 vs 2.843 14 13.1429 vs 2.85714 15 13.7348 vs 2.90691 16 14.3687 vs 3.0285 17 14.8823 vs 3.05717 18 14.9571 vs 2.98018 19 14.9127 vs 3.0951 20 16.0786 vs 3.19521 21 15.8055 vs 3.24915 22 16.8087 vs 3.2521 23 16.7298 vs 3.25698 24 17.244 vs 3.36586 25 17.8416 vs 3.39384 26 18.8008 vs 3.40907 27 18.5591 vs 3.5254 28 19.761 vs 3.55468 29 20.06 vs 3.59869 30 20.4353 vs 3.6991 31 20.9082 vs 3.73028 32 21.0865 vs 3.82904 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree() pair throughput. The series is based on the v6.3 tag and considered as beta version. Please note it does not support vread() functionality yet. So it means that it is not fully complete. Any input/thoughts are welcome. Uladzislau Rezki (Sony) (9): mm: vmalloc: Add va_alloc() helper mm: vmalloc: Rename adjust_va_to_fit_type() function mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c mm: vmalloc: Add a per-CPU-zone infrastructure mm: vmalloc: Insert busy-VA per-cpu zone mm: vmalloc: Support multiple zones in vmallocinfo mm: vmalloc: Insert lazy-VA per-cpu zone mm: vmalloc: Offload free_vmap_area_lock global lock mm: vmalloc: Scale and activate cvz_size mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 448 insertions(+), 193 deletions(-) -- 2.30.2