The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. The first user that is covered in this patch set is the vm_area_struct rbtree in the mm_struct with the long term goal of reducing the contention of the mmap_sem. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly short than the rbtree so it has fewer cache misses. As you can see below, the performance is getting very close even without the vmacache. I am looking for ways to optimize the vma code to make this even faster and would like input. As this tree requires allocations to provide RCU functionality, it is worth avoiding expanding the tree only to contract it in the next step. This patch set is based on 5.10-rc1. Please note that 35821 lines of this patch is the test code. It is important to note that this is an RFC and there are limitations around what is currently supported. The current status of this release is: - **No support for 32 bit or nommu builds** - There is a performance regression with regards to kernel builds from -3% to -5% of system time. Elapsed time is within 1 second of 5.10-rc1. - Removal of the vmacache - The mm struct uses the maple tree to store and retrieve all VMAs - Decrease in performance in the following micro-benchmarks: - will-it-scale brk1 which tests insert speed of a vma as it is written of -25 to -33%. The test does not seem to test what it is meant to test. - kernbuild (~-4 to -5% system, less that -1% elapsed Amean) - Increase in performance in the following micro-benchmarks in Hmean: - will-it-scale malloc1-processes (~1-9%) - will-it-scale malloc1-threads (~29-71%) - will-it-scale pthread_mutex1-threads (~0-16%) - will-it-scale signal1-processes (2-17%) - will-it-scale brk1-threads **This test doesn't make sense, disregard** - Converted the following to use the advanced maple tree interface: - brk() - mmap_region() - do_munmap() - dup_mmap() - Currently operating in non-RCU mode. Once enabled, RCU mode will be enabled only when more than one thread is active on a task. At that point the maple tree will enable RCU lookup of VMAs. The long term goal of the maple tree is to reduce mmap_sem contention by removing users that cause contention one by one. This may lead to the lock being completely removed at some point. The secondary goals is to provide the kernel with a fast RCU tree with a simple interface and to clean up the mm code by removing the integration of the tree within the code and structures themselves. This patch set is based on 5.10-rc1. Link: https://github.com/oracle/linux-uek/releases/tag/howlett%2Fmaple%2F20201204 Implementation details: Each node is 256 Bytes and has multiple node types. The current implementation uses two node types and has a branching factor of 16 for leaf/non-alloc nodes and 10 for internal alloc nodes. The advantage of the internal alloc nodes is the ability to track the largest sub-tree gap for each entry. Even with the removal of the vmacache, the benchmarks are getting close to the rbtree, but I'd like some help in regards to making things faster. The tree is at a disadvantage of needing to allocate storage space for internal use where as the rbtree stored the data within the vm_area_struct. This is a necessary trade off to move to RCU mode in the future. Benchmark numbers from mmtests on a 144 core system: wis-pthreadmutex 5.10-rc1 maple tree Hmean pthread_mutex1-threads-2 19006635.84 ( 0.00%) 21036173.96 * 10.68%* Hmean pthread_mutex1-threads-5 10228467.75 ( 0.00%) 11890622.20 * 16.25%* Hmean pthread_mutex1-threads-8 13933097.01 ( 0.00%) 13856905.46 * -0.55%* Hmean pthread_mutex1-threads-12 11431792.97 ( 0.00%) 13082725.10 * 14.44%* Hmean pthread_mutex1-threads-21 14536655.83 ( 0.00%) 14318050.44 * -1.50%* Hmean pthread_mutex1-threads-30 14025017.40 ( 0.00%) 15451267.09 * 10.17%* Hmean pthread_mutex1-threads-48 15452188.47 ( 0.00%) 14355672.94 * -7.10%* Hmean pthread_mutex1-threads-79 13680017.69 ( 0.00%) 13664026.29 * -0.12%* Hmean pthread_mutex1-threads-110 14131595.92 ( 0.00%) 13659743.96 * -3.34%* Hmean pthread_mutex1-threads-141 13493999.70 ( 0.00%) 12951941.54 * -4.02%* wis-signal 5.10-rc1 maple tree Hmean signal1-processes-2 1804422.41 ( 0.00%) 1827311.56 * 1.27%* Hmean signal1-processes-5 4460172.84 ( 0.00%) 4476128.55 * 0.36%* Hmean signal1-processes-8 6900853.35 ( 0.00%) 6973210.80 * 1.05%* Hmean signal1-processes-12 9929132.51 ( 0.00%) 10055390.13 * 1.27%* Hmean signal1-processes-21 5864723.57 ( 0.00%) 6745451.58 * 15.02%* Hmean signal1-processes-30 6183688.25 ( 0.00%) 6405037.68 * 3.58%* Hmean signal1-processes-48 5439087.23 ( 0.00%) 5631558.18 * 3.54%* Hmean signal1-processes-79 6512401.34 ( 0.00%) 6706141.64 * 2.97%* Hmean signal1-processes-110 8056779.10 ( 0.00%) 8494438.86 * 5.43%* Hmean signal1-processes-141 9320516.16 ( 0.00%) 10571140.16 * 13.42%* wis-malloc 5.10-rc1 maple tree Hmean brk1-processes-2 4665394.67 ( 0.00%) 3531806.58 * -24.30%* Hmean brk1-processes-5 11772601.71 ( 0.00%) 8829092.61 * -25.00%* Hmean brk1-processes-8 18257028.74 ( 0.00%) 13720140.11 * -24.85%* Hmean brk1-processes-12 26611276.11 ( 0.00%) 20040283.01 * -24.69%* Hmean brk1-processes-21 43293789.47 ( 0.00%) 32553619.66 * -24.81%* Hmean brk1-processes-30 62819136.48 ( 0.00%) 47206392.92 * -24.85%* Hmean brk1-processes-48 99084516.67 ( 0.00%) 74442789.08 * -24.87%* Hmean brk1-processes-79 149256917.87 ( 0.00%) 110957538.42 * -25.66%* Hmean brk1-processes-110 167784473.57 ( 0.00%) 119664465.82 * -28.68%* Hmean brk1-processes-141 186425159.76 ( 0.00%) 128345206.83 * -31.15%* Hmean malloc1-processes-2 551028.28 ( 0.00%) 493969.15 * -10.36%* Hmean malloc1-processes-5 1351566.71 ( 0.00%) 1113801.87 * -17.59%* Hmean malloc1-processes-8 1277080.35 ( 0.00%) 1252745.65 * -1.91%* Hmean malloc1-processes-12 1318056.55 ( 0.00%) 1309786.94 * -0.63%* Hmean malloc1-processes-21 2154975.49 ( 0.00%) 2068093.55 * -4.03%* Hmean malloc1-processes-30 2741084.95 ( 0.00%) 2809368.92 * 2.49%* Hmean malloc1-processes-48 4143258.89 ( 0.00%) 4069176.07 * -1.79%* Hmean malloc1-processes-79 5138031.43 ( 0.00%) 5162196.46 * 0.47%* Hmean malloc1-processes-110 5022505.12 ( 0.00%) 5439233.18 * 8.30%* Hmean malloc1-processes-141 4956600.59 ( 0.00%) 5393879.66 * 8.82%* Hmean malloc1-threads-2 13228.04 ( 0.00%) 184582.23 *1295.39%* Hmean malloc1-threads-5 100002.88 ( 0.00%) 169561.76 * 69.56%* Hmean malloc1-threads-8 91218.91 ( 0.00%) 154958.11 * 69.87%* Hmean malloc1-threads-12 97814.95 ( 0.00%) 123943.65 * 26.71%* Hmean malloc1-threads-21 45995.46 ( 0.00%) 43358.07 * -5.73%* Hmean malloc1-threads-30 23986.53 ( 0.00%) 47083.82 * 96.29%* Hmean malloc1-threads-48 15167.31 ( 0.00%) 30307.31 * 99.82%* Hmean malloc1-threads-79 8976.45 ( 0.00%) 8374.31 * -6.71%* Hmean malloc1-threads-110 5355.38 ( 0.00%) 1970.16 * -63.21%* Hmean malloc1-threads-141 8452.32 ( 0.00%) 12303.51 * 45.56%* kernbench 5.10-rc1 maple tree Amean user-2 884.80 ( 0.00%) 884.09 * 0.08%* Amean syst-2 141.84 ( 0.00%) 147.07 * -3.68%* Amean elsp-2 518.70 ( 0.00%) 520.85 * -0.41%* Amean user-4 900.83 ( 0.00%) 899.11 * 0.19%* Amean syst-4 145.76 ( 0.00%) 151.17 * -3.71%* Amean elsp-4 268.92 ( 0.00%) 268.97 * -0.02%* Amean user-8 916.07 ( 0.00%) 916.70 * -0.07%* Amean syst-8 145.34 ( 0.00%) 151.73 * -4.40%* Amean elsp-8 140.55 ( 0.00%) 141.88 * -0.94%* Amean user-16 930.43 ( 0.00%) 931.27 * -0.09%* Amean syst-16 147.23 ( 0.00%) 153.34 * -4.15%* Amean elsp-16 76.12 ( 0.00%) 76.25 * -0.17%* Amean user-32 980.03 ( 0.00%) 979.52 * 0.05%* Amean syst-32 158.15 ( 0.00%) 164.38 * -3.94%* Amean elsp-32 46.13 ( 0.00%) 45.98 * 0.31%* Amean user-64 1087.31 ( 0.00%) 1092.55 * -0.48%* Amean syst-64 173.28 ( 0.00%) 181.08 * -4.50%* Amean elsp-64 29.51 ( 0.00%) 30.06 * -1.88%* Amean user-128 1610.27 ( 0.00%) 1612.96 * -0.17%* Amean syst-128 228.93 ( 0.00%) 240.80 * -5.19%* Amean elsp-128 25.25 ( 0.00%) 25.42 * -0.70%* Amean user-256 1742.43 ( 0.00%) 1738.54 * 0.22%* Amean syst-256 241.50 ( 0.00%) 256.13 * -6.06%* Amean elsp-256 25.08 ( 0.00%) 25.42 * -1.35%* Amean user-288 1742.33 ( 0.00%) 1739.64 * 0.15%* Amean syst-288 241.16 ( 0.00%) 256.44 * -6.34%* Amean elsp-288 25.10 ( 0.00%) 25.44 * -1.35%* Liam R. Howlett (28): radix tree test suite: Enhancements for Maple Tree radix tree test suite: Add support for fallthrough attribute radix tree test suite: Add support for kmem_cache_free_bulk radix tree test suite: Add keme_cache_alloc_bulk() support Maple Tree: Add new data structure mm: Start tracking VMAs with maple tree mm/mmap: Introduce unlock_range() for code cleanup mm/mmap: Change find_vma() to use the maple tree mm/mmap: Change find_vma_prev() to use maple tree mm/mmap: Change unmapped_area and unmapped_area_topdown to use maple tree kernel/fork: Convert dup_mmap to use maple tree mm: Remove rb tree. mm/gup: Expose mm_populate_vma() for use when the vma is known mm/mmap: Change do_brk_flags() to expand existing VMA and add do_brk_munmap() mm/mmap: Change vm_brk_flags() to use mm_populate_vma() mm: Move find_vma_intersection to mmap.c and change implementation to maple tree. mm/mmap: Change mmap_region to use maple tree state mm/mmap: Drop munmap_vma_range() mm: Remove vmacache mm/mmap: Change __do_munmap() to avoid unnecessary lookups. mm/mmap: Change __do_munmap() to use a ma_state mm/mmap: Move mmap_region() below do_munmap() mm/mmap: Add do_mas_munmap() and wraper for __do_munmap() mmap: Use find_vma_intersection in do_mmap() for overlap mmap: Remove __do_munmap() in favour of do_mas_munmap() mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap() mmap: Update count_vma_pages_range() to only use one ma_state mmap: make remove_vma_list() inline Documentation/core-api/index.rst | 1 + Documentation/core-api/maple-tree.rst | 36 + MAINTAINERS | 12 + arch/x86/kernel/tboot.c | 2 +- drivers/firmware/efi/efi.c | 2 +- fs/exec.c | 3 - fs/proc/task_mmu.c | 1 - include/linux/maple_tree.h | 439 + include/linux/mm.h | 20 +- include/linux/mm_types.h | 30 +- include/linux/mm_types_task.h | 5 - include/linux/sched.h | 1 - include/linux/sched/mm.h | 3 + include/linux/vmacache.h | 28 - include/trace/events/maple_tree.h | 227 + include/trace/events/mmap.h | 97 + init/main.c | 2 + kernel/debug/debug_core.c | 12 - kernel/fork.c | 35 +- lib/Makefile | 3 +- lib/maple_tree.c | 5694 +++ lib/test_maple_tree.c | 35821 ++++++++++++++++ mm/Makefile | 2 +- mm/debug.c | 4 +- mm/gup.c | 20 + mm/init-mm.c | 4 +- mm/mmap.c | 1904 +- mm/mremap.c | 7 +- mm/util.c | 8 + mm/vmacache.c | 117 - tools/testing/radix-tree/.gitignore | 2 + tools/testing/radix-tree/Makefile | 13 +- tools/testing/radix-tree/generated/autoconf.h | 1 + tools/testing/radix-tree/linux.c | 78 +- tools/testing/radix-tree/linux/kernel.h | 8 + tools/testing/radix-tree/linux/maple_tree.h | 3 + tools/testing/radix-tree/linux/slab.h | 2 + tools/testing/radix-tree/maple.c | 58 + tools/testing/radix-tree/test.h | 1 + .../radix-tree/trace/events/maple_tree.h | 8 + 40 files changed, 43554 insertions(+), 1160 deletions(-) create mode 100644 Documentation/core-api/maple-tree.rst create mode 100644 include/linux/maple_tree.h delete mode 100644 include/linux/vmacache.h create mode 100644 include/trace/events/maple_tree.h create mode 100644 lib/maple_tree.c create mode 100644 lib/test_maple_tree.c delete mode 100644 mm/vmacache.c create mode 100644 tools/testing/radix-tree/linux/maple_tree.h create mode 100644 tools/testing/radix-tree/maple.c create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h -- 2.28.0