Hello, Folk! > This is v3. It is based on the 6.7.0-rc8. > > 1. Motivation > > - Offload global vmap locks making it scaled to number of CPUS; > - If possible and there is an agreement, we can remove the "Per cpu kva allocator" > to make the vmap code to be more simple; > - There were complains from XFS folk that a vmalloc might be contented > on the their workloads. > > 2. Design(high level overview) > > We introduce an effective vmap node logic. A node behaves as independent > entity to serve an allocation request directly(if possible) from its pool. > That way it bypasses a global vmap space that is protected by its own lock. > > An access to pools are serialized by CPUs. Number of nodes are equal to > number of CPUs in a system. Please note the high threshold is bound to > 128 nodes. > > Pools are size segregated and populated based on system demand. The maximum > alloc request that can be stored into a segregated storage is 256 pages. The > lazily drain path decays a pool by 25% as a first step and as second populates > it by fresh freed VAs for reuse instead of returning them into a global space. > > When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start > address is converted into a correct node where it should be placed and resided. > Doing so we balance VAs across the nodes as a result an access becomes scalable. > The addr_to_node() function does a proper address conversion to a correct node. > > A vmap space is divided on segments with fixed size, it is 16 pages. That way > any address can be associated with a segment number. Number of segments are > equal to num_possible_cpus() but not grater then 128. The numeration starts > from 0. See below how it is converted: > > static inline unsigned int > addr_to_node_id(unsigned long addr) > { > return (addr / zone_size) % nr_nodes; > } > > On a free path, a VA can be easily found by converting its "va_start" address > to a certain node it resides. It is moved from "busy" data to "lazy" data structure. > Later on, as noted earlier, the lazy kworker decays each node pool and populates it > by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc > request. > > 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor > > sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > > <default perf> > 94.41% 0.89% [kernel] [k] _raw_spin_lock > 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath > 76.13% 0.28% [kernel] [k] __vmalloc_node_range > 72.96% 0.81% [kernel] [k] alloc_vmap_area > 56.94% 0.00% [kernel] [k] __get_vm_area_node > 41.95% 0.00% [kernel] [k] vmalloc > 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test > 35.17% 0.00% [kernel] [k] ret_from_fork_asm > 35.17% 0.00% [kernel] [k] ret_from_fork > 35.17% 0.00% [kernel] [k] kthread > 35.08% 0.00% [test_vmalloc] [k] test_func > 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test > 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test > 23.53% 0.25% [kernel] [k] vfree.part.0 > 21.72% 0.00% [kernel] [k] remove_vm_area > 20.08% 0.21% [kernel] [k] find_unlink_vmap_area > 2.34% 0.61% [kernel] [k] free_vmap_area_noflush > <default perf> > vs > <patch-series perf> > 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test > 63.36% 0.02% [kernel] [k] vmalloc > 63.34% 2.64% [kernel] [k] __vmalloc_node_range > 30.42% 4.46% [kernel] [k] vfree.part.0 > 28.98% 2.51% [kernel] [k] __alloc_pages_bulk > 27.28% 0.19% [kernel] [k] __get_vm_area_node > 26.13% 1.50% [kernel] [k] alloc_vmap_area > 21.72% 21.67% [kernel] [k] clear_page_rep > 19.51% 2.43% [kernel] [k] _raw_spin_lock > 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath > 13.40% 2.07% [kernel] [k] free_unref_page > 10.62% 0.01% [kernel] [k] remove_vm_area > 9.02% 8.73% [kernel] [k] insert_vmap_area > 8.94% 0.00% [kernel] [k] ret_from_fork_asm > 8.94% 0.00% [kernel] [k] ret_from_fork > 8.94% 0.00% [kernel] [k] kthread > 8.29% 0.00% [test_vmalloc] [k] test_func > 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test > 5.30% 4.73% [kernel] [k] purge_vmap_node > 4.47% 2.65% [kernel] [k] free_vmap_area_noflush > <patch-series perf> > > confirms that a native_queued_spin_lock_slowpath goes down to > 16.51% percent from 93.07%. > > The throughput is ~12x higher: > > urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > Run the test with following parameters: run_test_mask=7 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 10m51.271s > user 0m0.013s > sys 0m0.187s > urezki@pc638:~$ > > urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > Run the test with following parameters: run_test_mask=7 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 0m51.301s > user 0m0.015s > sys 0m0.040s > urezki@pc638:~$ > > 4. Changelog > > v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/ > v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@xxxxxxxxx/ > > Delta v2 -> v3: > - fix comments from v2 feedback; > - switch from pre-fetch chunk logic to a less complex size based pools. > > Baoquan He (1): > mm/vmalloc: remove vmap_area_list > > Uladzislau Rezki (Sony) (10): > mm: vmalloc: Add va_alloc() helper > mm: vmalloc: Rename adjust_va_to_fit_type() function > mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c > mm: vmalloc: Remove global vmap_area_root rb-tree > mm: vmalloc: Remove global purge_vmap_area_root rb-tree > mm: vmalloc: Offload free_vmap_area_lock lock > mm: vmalloc: Support multiple nodes in vread_iter > mm: vmalloc: Support multiple nodes in vmallocinfo > mm: vmalloc: Set nr_nodes based on CPUs in a system > mm: vmalloc: Add a shrinker to drain vmap pools > > .../admin-guide/kdump/vmcoreinfo.rst | 8 +- > arch/arm64/kernel/crash_core.c | 1 - > arch/riscv/kernel/crash_core.c | 1 - > include/linux/vmalloc.h | 1 - > kernel/crash_core.c | 4 +- > kernel/kallsyms_selftest.c | 1 - > mm/nommu.c | 2 - > mm/vmalloc.c | 1049 ++++++++++++----- > 8 files changed, 786 insertions(+), 281 deletions(-) > > -- > 2.39.2 > There is one thing that i have to clarify and which is open for me yet. Test machine: quemu x86_64 system 64 CPUs 64G of memory test suite: test_vmalloc.sh environment: mm-unstable, branch: next-20240220 where this series is located. On top of it i added locally Suren's Baghdasaryan Memory allocation profiling v3 for better understanding of memory usage. Before running test, the condition is as below: urezki@pc638:~$ sort -h /proc/allocinfo 27.2MiB 6970 mm/memory.c:1122 module:memory func:folio_prealloc 79.1MiB 20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 112MiB 8689 mm/slub.c:2202 module:slub func:alloc_slab_page 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 936 63618 0 134 63236 Swap: 0 0 0 urezki@pc638:~$ The test-suite stresses vmap/vmalloc layer by creating workers which in a tight loop do alloc/free, i.e. it is considered as extreme. Below three identical tests were done with only one difference, which is 64, 128 and 256 kworkers: 1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 urezki@pc638:~$ sort -h /proc/allocinfo 80.1MiB 20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 153MiB 39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 178MiB 13259 mm/slub.c:2202 module:slub func:alloc_slab_page 350MiB 89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 1417 63054 0 298 62755 Swap: 0 0 0 urezki@pc638:~$ 2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 urezki@pc638:~$ sort -h /proc/allocinfo 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 154MiB 39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 196MiB 14038 mm/slub.c:2202 module:slub func:alloc_slab_page 1.20GiB 315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 2556 61914 0 302 61616 Swap: 0 0 0 urezki@pc638:~$ 3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 urezki@pc638:~$ sort -h /proc/allocinfo 127MiB 32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 197MiB 50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 278MiB 18519 mm/slub.c:2202 module:slub func:alloc_slab_page 5.36GiB 1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 6741 57652 0 394 57431 Swap: 0 0 0 urezki@pc638:~$ pagetable_alloc - gets increased as soon as a higher pressure is applied by increasing number of workers. Running same number of jobs on a next run does not increase it and stays on same level as on previous. /** * pagetable_alloc - Allocate pagetables * @gfp: GFP flags * @order: desired pagetable order * * pagetable_alloc allocates memory for page tables as well as a page table * descriptor to describe that memory. * * Return: The ptdesc describing the allocated page tables. */ static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) { struct page *page = alloc_pages(gfp | __GFP_COMP, order); return page_ptdesc(page); } Could you please comment on it? Or do you have any thought? Is it expected? Is a page-table ever shrink? /proc/slabinfo does not show any high "active" or "number" of objects to be used by any cache. /proc/meminfo - "VmallocUsed" stays low after those 3 tests. I have checked it with KASAN, KMEMLEAK and i do not see any issues. Thank you for the help! -- Uladzislau Rezki