[PATCH v4 00/66] Introducing the Maple Tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on v5.16-rc2

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20211130

v4 changes:
 - Added the option of using an external lock to the maple tree.  This
   is especially useful for the VMA code as the mmap_lock can be used.
 - Added a vma iterator to abstract vma knowledge from subsystems
 - Rewrote mtree_walk() to be about 40% faster on simulated lookups in
   the test code
 - Reduce complexity in write side by reducing passing of flags in favor
   of smaller functions with minimal changes between the calls
 - Added struct ma_wr_state to keep track of a maple write state.  This
   greatly reduces the passing of arguments during a write operation
 - Added new mas_find_rev() interface to find a value at a specific
   index or lower


v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@xxxxxxxxxx/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@xxxxxxxxxx/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@xxxxxxxxxx/

Performance on a 144 core x86:

It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.

kernbench, increased system time, less user time:
                        5.15-rc2               5-15-rc2 + maple tree
Amean     user-2        883.21 (   0.00%)      880.33 *   0.33%*
Amean     syst-2        164.61 (   0.00%)      167.08 *  -1.50%*
Amean     elsp-2        529.54 (   0.00%)      529.70 *  -0.03%*
Amean     user-4        908.66 (   0.00%)      906.34 *   0.26%*
Amean     syst-4        172.86 (   0.00%)      175.70 *  -1.64%*
Amean     elsp-4        277.10 (   0.00%)      277.41 *  -0.11%*
Amean     user-8        960.68 (   0.00%)      959.70 *   0.10%*
Amean     syst-8        180.93 (   0.00%)      186.16 *  -2.89%*
Amean     elsp-8        151.82 (   0.00%)      151.25 *   0.38%*
Amean     user-16      1043.21 (   0.00%)     1043.24 *  -0.00%*
Amean     syst-16       191.97 (   0.00%)      197.97 *  -3.13%*
Amean     elsp-16        86.55 (   0.00%)       87.12 *  -0.65%*
Amean     user-32      1203.33 (   0.00%)     1201.70 *   0.14%*
Amean     syst-32       214.19 (   0.00%)      222.12 *  -3.70%*
Amean     elsp-32        54.20 (   0.00%)       55.17 *  -1.79%*
Amean     user-64      1232.98 (   0.00%)     1233.44 *  -0.04%*
Amean     syst-64       217.38 (   0.00%)      224.75 *  -3.39%*
Amean     elsp-64        32.96 (   0.00%)       33.17 *  -0.63%*
Amean     user-128     1608.15 (   0.00%)     1609.13 *  -0.06%*
Amean     syst-128      270.77 (   0.00%)      281.21 *  -3.85%*
Amean     elsp-128       25.92 (   0.00%)       26.10 *  -0.71%*


Ops NUMA alloc hit                2198520208.00  2198508092.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local              2198582067.00  2198242905.00
Ops NUMA base-page range updates     3797662.00     3214462.00
Ops NUMA PTE updates                 1600670.00      966782.00
Ops NUMA PMD updates                    4291.00        4390.00
Ops NUMA hint faults                  381225.00      234029.00
Ops NUMA hint local faults %          112696.00      110041.00
Ops NUMA hint local percent               29.56          47.02
Ops NUMA pages migrated               308892.00      139306.00
Ops AutoNUMA cost                       1938.58        1195.29

Increase in performance in the following micro-benchmarks in Hmean:
- wis malloc1-threads: Increase of 1041% to 14%

Decrease in performance in the following micro-benchmarks in Hmean:
- wis brk1-processes: Decrease of -38% to -43%

Mixed:
- wis malloc1-processes: +5% to -18%

ebizzy testing:

ebizzy shows a slowdown of 13% to 32%.  This benchmark seems to be
especially noisy and both rb and maple tree exhibit runs that are
abnormally high in both cache misses and instruction count.  Several
runs were used to remove the outliers.

When perf is used to analyze ebizzy, the rbtree spends less time in
native_queued_spin_lock_slowpath (58.95% rb vs 67.48% maple).  lockdep
showed that the mmap_lock is contended more for the maple tree.
Investigation into the contention showed that increasing the spin time
prior to the rw semaphore going to sleep helps the performance.  The
store operations are slower on the maple tree, so the write lock would
be held longer which could account for the spin time expiring prior to
the lock being released.  The plan of introducing lockless lookups will
remove all delays for the reader in this regard.

Patch organization:

Patches 1 to 4 are radix tree test suite additions for maple tree
support.

Patch 5 adds the maple tree.  The bulk of which is test code.

Patches 6-16 are the removal of the rbtree from the mm_struct.  This now
includes the introduction of the vma iterator.

Patch 17 optimizes __vma_adjust() for the maple tree.

Patches 18-24 are the removal of the vmacache from the kernel.

Patches 25-65 are the removal of the linked list

Patch 66 is a small cleanup from the removal of the vma linked list.

Liam R. Howlett (57):
  radix tree test suite: Add pr_err define
  radix tree test suite: Add kmem_cache_set_non_kernel()
  radix tree test suite: Add allocation counts and size to kmem_cache
  radix tree test suite: Add support for slab bulk APIs
  Maple Tree: Add new data structure
  mm: Start tracking VMAs with maple tree
  mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: Use maple tree for unmapped_area{_topdown}
  kernel/fork: Use maple tree for dup_mmap() during forking
  mm: Remove rb tree.
  mmap: Change zeroing of maple tree in __vma_adjust
  xen: Use vma_lookup() in privcmd_ioctl_mmap()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: Use maple tree operations for find_vma_intersection() and
    find_vma()
  mm/mmap: Use advanced maple tree API for mmap_region()
  mm: Remove vmacache
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Reorganize munmap to use maple states
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  arm64: Remove mmap linked list from vdso
  parisc: Remove mmap linked list from cache handling
  powerpc: Remove mmap linked list walks
  s390: Remove vma linked list walks
  x86: Remove vma linked list walks
  xtensa: Remove vma linked list walks
  cxl: Remove vma linked list walk
  optee: Remove vma linked list walk
  um: Remove vma linked list walk
  binfmt_elf: Remove vma linked list walk
  exec: Use VMA iterator instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  userfaultfd: Use maple tree iterator to iterate VMAs
  ipc/shm: Use VMA iterator instead of linked list
  acct: Use VMA iterator instead of linked list
  perf: Use VMA iterator
  sched: Use maple tree iterator to walk VMAs
  fork: Use VMA iterator
  bpf: Remove VMA linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/khugepaged: Use maple tree iterators instead of vma linked list
  mm/ksm: Use maple tree iterators instead of vma linked list
  mm/madvise: Use vma_find() instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use maple tree iterators instead of vma linked list
  mm/mlock: Use maple tree iterators instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_find() instead of vma linked list
  mm/msync: Use vma_find() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_find() instead of vma linked list
  mm/swapfile: Use maple tree iterator instead of vma linked list
  mm: Remove the vma linked list
  mm/mmap: Drop range_has_overlap() function

Matthew Wilcox (Oracle) (9):
  mm: Add VMA iterator
  mmap: Use the VMA iterator in count_vma_pages_range()
  damon: Convert __damon_va_three_regions to use the VMA iterator
  proc: Remove VMA rbtree use from nommu
  mm: Convert vma_lookup() to use the Maple Tree
  coredump: Remove vma linked list walk
  binfmt_elf: Take the mmap lock when walking the VMA list
  i915: Use the VMA iterator
  nommu: Remove uses of VMA linked list

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple-tree.rst         |   496 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/vdso.c                      |     3 +-
 arch/parisc/kernel/cache.c                    |     9 +-
 arch/powerpc/kernel/vdso.c                    |     6 +-
 arch/powerpc/mm/book3s32/tlb.c                |    11 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
 arch/s390/configs/debug_defconfig             |     1 -
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     6 +-
 arch/um/kernel/tlb.c                          |    14 +-
 arch/x86/entry/vdso/vma.c                     |     9 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |     3 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
 drivers/misc/cxl/fault.c                      |    43 +-
 drivers/tee/optee/call.c                      |    18 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     6 +-
 fs/coredump.c                                 |    33 +-
 fs/exec.c                                     |    12 +-
 fs/proc/base.c                                |     5 +-
 fs/proc/internal.h                            |     2 +-
 fs/proc/task_mmu.c                            |    74 +-
 fs/proc/task_nommu.c                          |    45 +-
 fs/userfaultfd.c                              |    49 +-
 include/linux/maple_tree.h                    |   559 +
 include/linux/mm.h                            |    73 +-
 include/linux/mm_types.h                      |    43 +-
 include/linux/mm_types_task.h                 |     5 -
 include/linux/sched.h                         |     1 -
 include/linux/sched/mm.h                      |    13 +
 include/linux/userfaultfd_k.h                 |     7 +-
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    21 +-
 kernel/acct.c                                 |    11 +-
 kernel/bpf/task_iter.c                        |    21 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     3 +-
 kernel/events/uprobes.c                       |     9 +-
 kernel/fork.c                                 |    69 +-
 kernel/sched/fair.c                           |    10 +-
 lib/Kconfig.debug                             |    15 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6771 +++
 lib/test_maple_tree.c                         | 37202 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr.c                              |    53 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     9 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    81 +-
 mm/khugepaged.c                               |    11 +-
 mm/ksm.c                                      |    19 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    53 +-
 mm/mlock.c                                    |    19 +-
 mm/mmap.c                                     |  2030 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    31 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   123 +-
 mm/oom_kill.c                                 |     3 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     5 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |    13 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     4 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     3 +
 87 files changed, 47098 insertions(+), 1794 deletions(-)
 create mode 100644 Documentation/core-api/maple-tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

-- 
2.30.2





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux