Hi, The following is a summarized repost of the range locking mmap_sem idea[1] and is _not_ intended for being considered upstream as there are quite a few issues that arise with this approach of tackling mmap_sem contention (keep reading). In fact this patch is quite incomplete and will break compiling on anything non-x86, and is also _completely broken_ for ksm and hmm. That being said, this does build an enterprise kernel and survives a number of workloads as well as 'runltp -f syscalls'. The previous series is a complete range locking conversion, which ensured we had all the range locking apis we needed. The changelog also included a number of performance numbers and overall design. While finding issues with the code itself is always welcome, the idea of this series is to discuss what can be done on top of it, if anything. >From a locking pov, most recently there has been a revival in the interest of the range lock code for dchinner's plans of range locking the i_rwsem. However, it showed that xfs's extent tree significantly outperformed[2] the (full) range lock. The performance differences when doing 1:1 rwsem comparisons, have already been shown in [1]. Considering both the range lock and the extent tree lock the whole tree, most of this performance penalties are due to the fact that rbtrees' depth is a lot larger than btree's, so the latter avoids most of the pointer chasing which is a common performance issue. This was a trade-off for not having to allocate memory for the range nodes. However, on the _positive side_, and which is what we care most about for mmap_sem, when actually using the lock as intended, the range locking did show its purpose: IOPS read/write (buffered IO) fio processes rwsem rangelock 1 57k / 57k 64k / 64k 2 61k / 61k 111k / 111k 4 61k / 61k 228k / 228k 8 55k / 55k 195k / 195k 16 15k / 15k 40k / 40k So it would be nice to apply this concept to our address space and allow mmaps, munmaps and pagefaults to all work concurrently in non-overlapping scenarios -- which is what is provided by userspace mm related syscalls. However, when using the range lock without a full range, a number of issues around the vma immediately popup as a consequence of this *top-down* approach to solving scalability: Races within a vma: non-overlapping regions can still belong to the same vma, hence wrecking merges and splits. One popular idea is to have a vma->rwsem (taken, for example, after a find_vma()), however, this throws out the window any potential scalability gains for large vmas as we just end up just moving down the point of contention. The same problem occurs when refcouting the vma (such as with speculative pfs). There's also the fact that we can end up taking numerous vma locks as the vma list is later traversed once the first vma is found. Alternatively, we could just expand the passed range such that it covers the whole first and last vma(s) endpoints; of course we don't have that information aprori (protected by mmap_sem :), and enlarging the range _after_ acquiring the lock opens a can of worms because now we have to inform userspace and/or deadlock, among others. Similarly, there's the issue of keeping the vma tree correct during modifications as well as regular find_vma()s. Laurent has already pointed out that we have too many ways of getting a vma: the tree, the list and the vmacache, all currently protected by mmap_sem and breaks because of the above when not using full ranges. This also touches a bit in a more *bottom-up* approach to mmap_sem performance, which scales from within, instead of putting a big rangelock tree on top of the address space. Matthew has pointed out a the xarray as well as an rcu based maple tree[3] replacement of the rbtree, however we already have the vmacache so most of the benefits of a shallower data structure are unnecessary, in cache-hot situations, naturally. The vma-list is easily removable once we have O(1) next/prev pointers, which for rbtrees can be done via threading the data structure (at the cost of extra branch for every level down the tree when inserting). Maple trees already give us this. So all in all, if we were going to go down this path of a cache friendlier tree, we'd end up needing comparisons of the maple tree vs the current vmacache+rbtree combo. Regarding rcu-ifying the vma tree and replacing read locking (and therefore plays nicer with cachelines), I sounds nice, it does not seem practical considering that the page tables cannot be rcu-ified. I'm sure I'm missing a lot more, but I'm hoping to kickstart the conversation again. Patches 1-2: adds the range locking machinery. This is rebased on the rbtree optimizations for interval trees such that we can quickly detect overlapping ranges. Some bug fixes and more documentation as also been added, with an ordering example in the source code. Patch 3: adds new mm locking wrappers around mmap_sem. Patches 4: teaches page fault paths about mmrange (specifically adding the range in question to the struct vm_fault). In addition, most of these patches update mmap_sem callers. Patch 5: is mostly a collection of shameless hacks to avoid for now teaching callers about range locking and just enlarging the series needlessly. Patches 6-13: adds most of the trivial conversions; most of this is generated with a cocinelle script[4], it's rather lame but gets most of the job done. Fix ups are pretty straightforward, yet manual. Patch 14: finally do the actual conversion and replace mmap_sem with the full range mmap_lock. Applies on top of today's linux-next tree. [1] https://lkml.org/lkml/2018/2/4/235 [2] https://lore.kernel.org/linux-fsdevel/20190416122240.GN29573@xxxxxxxxxxxxxxxxxxx/ [3] https://lore.kernel.org/lkml/20190314195452.GN19508@xxxxxxxxxxxxxxxxxxxxxx/ [4] http://linux-scalability.org/range-mmap_lock/mmap_sem.cocci Thanks! Davidlohr Bueso (14): interval-tree: build unconditionally Introduce range reader/writer lock mm: introduce mm locking wrappers mm: teach pagefault paths about range locking mm: remove some BUG checks wrt mmap_sem mm: teach the mm about range locking fs: teach the mm about range locking arch/x86: teach the mm about range locking virt: teach the mm about range locking net: teach the mm about range locking ipc: teach the mm about range locking kernel: teach the mm about range locking drivers: teach the mm about range locking mm: convert mmap_sem to range mmap_lock arch/x86/entry/vdso/vma.c | 12 +- arch/x86/events/core.c | 2 +- arch/x86/kernel/tboot.c | 2 +- arch/x86/kernel/vm86_32.c | 5 +- arch/x86/kvm/paging_tmpl.h | 9 +- arch/x86/mm/debug_pagetables.c | 8 +- arch/x86/mm/fault.c | 37 +- arch/x86/mm/mpx.c | 15 +- arch/x86/um/vdso/vma.c | 5 +- drivers/android/binder_alloc.c | 7 +- drivers/firmware/efi/efi.c | 2 +- drivers/gpu/drm/Kconfig | 2 - drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 7 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 11 +- drivers/gpu/drm/amd/amdkfd/kfd_events.c | 5 +- drivers/gpu/drm/i915/Kconfig | 1 - drivers/gpu/drm/i915/i915_gem.c | 5 +- drivers/gpu/drm/i915/i915_gem_userptr.c | 13 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 23 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +- drivers/gpu/drm/radeon/radeon_gem.c | 8 +- drivers/gpu/drm/radeon/radeon_mn.c | 7 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 4 +- drivers/infiniband/core/umem.c | 7 +- drivers/infiniband/core/umem_odp.c | 14 +- drivers/infiniband/core/uverbs_main.c | 5 +- drivers/infiniband/hw/mlx4/mr.c | 5 +- drivers/infiniband/hw/qib/qib_user_pages.c | 7 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +- drivers/iommu/Kconfig | 1 - drivers/iommu/amd_iommu_v2.c | 7 +- drivers/iommu/intel-svm.c | 7 +- drivers/media/v4l2-core/videobuf-core.c | 5 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 5 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +- drivers/misc/cxl/cxllib.c | 5 +- drivers/misc/cxl/fault.c | 5 +- drivers/misc/sgi-gru/grufault.c | 20 +- drivers/misc/sgi-gru/grufile.c | 5 +- drivers/misc/sgi-gru/grukservices.c | 4 +- drivers/misc/sgi-gru/grumain.c | 6 +- drivers/misc/sgi-gru/grutables.h | 5 +- drivers/oprofile/buffer_sync.c | 12 +- drivers/staging/kpc2000/kpc_dma/fileops.c | 5 +- drivers/tee/optee/call.c | 5 +- drivers/vfio/vfio_iommu_type1.c | 11 +- drivers/xen/gntdev.c | 5 +- drivers/xen/privcmd.c | 17 +- fs/aio.c | 5 +- fs/coredump.c | 5 +- fs/exec.c | 21 +- fs/io_uring.c | 5 +- fs/proc/base.c | 23 +- fs/proc/internal.h | 2 + fs/proc/task_mmu.c | 32 +- fs/proc/task_nommu.c | 22 +- fs/userfaultfd.c | 50 +- include/linux/hmm.h | 7 +- include/linux/huge_mm.h | 2 - include/linux/hugetlb.h | 9 +- include/linux/lockdep.h | 33 ++ include/linux/mm.h | 108 +++- include/linux/mm_types.h | 4 +- include/linux/pagemap.h | 6 +- include/linux/range_lock.h | 189 +++++++ include/linux/userfaultfd_k.h | 5 +- ipc/shm.c | 10 +- kernel/acct.c | 5 +- kernel/bpf/stackmap.c | 16 +- kernel/events/core.c | 5 +- kernel/events/uprobes.c | 27 +- kernel/exit.c | 9 +- kernel/fork.c | 18 +- kernel/futex.c | 7 +- kernel/locking/Makefile | 2 +- kernel/locking/range_lock.c | 667 +++++++++++++++++++++++ kernel/sched/fair.c | 5 +- kernel/sys.c | 22 +- kernel/trace/trace_output.c | 5 +- lib/Kconfig | 14 - lib/Kconfig.debug | 1 - lib/Makefile | 3 +- mm/filemap.c | 10 +- mm/frame_vector.c | 10 +- mm/gup.c | 86 +-- mm/hmm.c | 7 +- mm/hugetlb.c | 14 +- mm/init-mm.c | 2 +- mm/internal.h | 3 +- mm/khugepaged.c | 78 +-- mm/ksm.c | 45 +- mm/madvise.c | 36 +- mm/memcontrol.c | 10 +- mm/memory.c | 28 +- mm/mempolicy.c | 34 +- mm/migrate.c | 10 +- mm/mincore.c | 6 +- mm/mlock.c | 20 +- mm/mmap.c | 73 +-- mm/mmu_notifier.c | 9 +- mm/mprotect.c | 17 +- mm/mremap.c | 9 +- mm/msync.c | 9 +- mm/nommu.c | 25 +- mm/oom_kill.c | 5 +- mm/pagewalk.c | 3 - mm/process_vm_access.c | 8 +- mm/shmem.c | 2 +- mm/swapfile.c | 5 +- mm/userfaultfd.c | 21 +- mm/util.c | 10 +- net/ipv4/tcp.c | 5 +- net/xdp/xdp_umem.c | 5 +- security/tomoyo/domain.c | 2 +- virt/kvm/arm/mmu.c | 17 +- virt/kvm/async_pf.c | 7 +- virt/kvm/kvm_main.c | 18 +- 118 files changed, 1776 insertions(+), 594 deletions(-) create mode 100644 include/linux/range_lock.h create mode 100644 kernel/locking/range_lock.c -- 2.16.4