Hi, This is the first version of my work towards fine grained MM locking. This is still early work - I am happy with my page fault changes, but want to expand on the mmap/munmap side of things before I send the next version. I have previously shared this with some of the copied folks (for those who received that, there are no additional changes in this public resend). Please expect a v2 within a few weeks, with further changes for fine grained range locking in the mmap and munmap paths. This work originated in discussions at LSF/MM 2019; it is intended to address the latency issues that are caused by false conflicts between threads working on separate parts of their address space. The priorities are to keep things as simple as possible, and to allow for progressive conversion of the code base to finer grained MM locks. The general approach is to replace the mmap_sem rwsem with a range lock. Initially all lock/unlock sites are automatically converted to lock the entire address space through a new API. Then, the API is extended to support range locking. Locking sites can then be progressively converted to use range locking, while leaving unconverted sites working with no code changes. When using a range lock (as opposed to a coarse lock), the following rules apply: - Some structures (notably the vma rbtree and associated statistics) are per-mm. They need to be locked separately using a new mm_vma_lock. The entire point of this patch set is to reduce false sharing latencies, so the mm_vma_lock must be held only for short times. We expect to do O(log N) operations holding the lock (for example, walking or updating the vma rbtree) but no O(N) operations (such as iterating on all vmas within a range or all mapped pages within a range). - Code holding the mm_vma_lock should only update vma attributes for the range it has a write lock for. However, range locks only protects the vma's attributes, not the vmas themselves - vmas can still be split or merged with their neighbors if they have compatible attributes. - Code holding a range lock but not the mm_vma_lock must be prepared for the vmas at both ends of the locked range to be merged with their neighbors outside of the locked range. The easiest way to do that is to copy the vma of record into a pseudo-vma before releasing the mm_vma_lock (this is a bit kludgy and I would prefer to copy only the necessary VMA attributes, but using a pseudo-vma makes it easier to maintain this patchset out of mainline for the moment). Call sites that take a range lock usualy immediately take the mm_vma_lock next - it would probably be more efficient to collapse mm_vma_lock with the mutex that protects the range lock structures. This isn't done yet as I tried to simplify the initial implementation. In the future I would also like to remove the various workarounds we have been doing to limit mmap_sem hold times (i.e. FAULT_FLAG_ALLOW_RETRY, vm_populate and munmap downgrading to a read lock, ...) which shouldn't be necessary if the locking was only effective on the memory ranges affected by each operation. The included changes apply on top of upstream kernel v5.5. Please apply with git am -p0 - I'm not sure why my git format-patch setup requires that. Commits 1 to 6 implement a range locking API: - 1 implements coarse locking as wrappers around rwsem; - 2 converts most mmap_sem locking sites to use the new coarse locking API (using coccinelle to automate the conversion); - 3 converts remaining mmap_sem locking sites which were missed by coccinelle; - 4 extends the API to support range locking. The initial implementation still uses coarse locking (ignoring the range); but it validates that the callers use matching ranges in lock and unlock calls; - 5 prepares callers to allow for sleeping during unlock; - 6 actually implements the range locking functions. Commits 7 to 12 allow the x86 fault handler to specify a range that may be released while handling the fault: - 7 adds a range field to struct mm_fault; - 8 makes handle_mm_fault() populate that field; - 9 and 10 honor it when dropping mmap_sem during fault handling; - 11 is a cleanup to the x86 fault handler to prepare for 12; - 12 changes the x86 fault handler to use an explicit lock range. Commits 13 to 15 prepare for operating on a pseudo-vma during faults: - 13 adds a prepare_vma_fault which may update the vma of record (specifically, allocate an anon_vma) before creating the pseudo-vma; - 14 disables swap vma readahead as its implementation keeps stats in the vma; - 15 changes the x86 fault handler to use pseudo-vmas when handling anon vmas. Commits 16 and 17 implement range locking in x86 anonymous vma faults: - Commit 16 adds the vma locking API to be used to manipulate vmas when holding a fine grained ranged lock; - Commit 17 converts the x86 fault handler to use a pmd sized range lock when operating on anon vmas. Commits 18 to 20 extend the above to also work on filemap based files: - Commit 18 makes sure we release the correct range when dropping mmap_sem during filemap file access; - Commit 19 tags vm_operations that support range locking; - Commit 20 makes the x86 fault handler use fine grained ranges when faulting the supported files. Commits 21 to 24 implement range locking for the most basic mmap() case: - 21 adds a locked argument to do_mmap(); - 22 makes do_mmap acquire the mmap_sem if locked is false; - 23 converts soem easy call sites to pass locked=false; - 24 changes do_mmap to acquire a fine grained lock in the easiest case (anonymous mapping, known address, no prior existing mapping). Michel Lespinasse (24): MM locking API: initial implementation as rwsem wrappers MM locking API: use coccinelle to convert mmap_sem rwsem call sites MM locking API: manual conversion of mmap_sem call sites missed by coccinelle MM locking API: add range arguments MM locking API: allow for sleeping during unlock MM locking API: implement fine grained range locks mm/memory: add range field to struct vm_fault mm/memory: allow specifying MM lock range to handle_mm_fault() do_swap_page: use the vmf->range field when dropping mmap_sem handle_userfault: use the vmf->range field when dropping mmap_sem x86 fault handler: merge bad_area() functions x86 fault handler: use an explicit MM lock range mm/memory: add prepare_mm_fault() function mm/swap_state: disable swap vma readahead x86 fault handler: use a pseudo-vma when operating on anonymous vmas. MM locking API: add vma locking API x86 fault handler: implement range locking shared file mappings: use the vmf->range field when dropping mmap_sem mm: add field to annotate vm_operations that support range locking x86 fault handler: extend range locking to supported file vmas do_mmap: add locked argument do_mmap: implement locked argument do_mmap: use locked=false in vm_mmap_pgoff() and aio_setup_ring() do_mmap: implement easiest cases of fine grained locking arch/alpha/kernel/traps.c | 4 +- arch/alpha/mm/fault.c | 10 +- arch/arc/kernel/process.c | 4 +- arch/arc/kernel/troubleshoot.c | 4 +- arch/arc/mm/fault.c | 4 +- arch/arm/kernel/process.c | 4 +- arch/arm/kernel/swp_emulate.c | 4 +- arch/arm/lib/uaccess_with_memcpy.c | 16 +- arch/arm/mm/fault.c | 6 +- arch/arm64/kernel/traps.c | 4 +- arch/arm64/kernel/vdso.c | 8 +- arch/arm64/mm/fault.c | 8 +- arch/csky/kernel/vdso.c | 4 +- arch/csky/mm/fault.c | 8 +- arch/hexagon/kernel/vdso.c | 4 +- arch/hexagon/mm/vm_fault.c | 8 +- arch/ia64/kernel/perfmon.c | 8 +- arch/ia64/mm/fault.c | 8 +- arch/ia64/mm/init.c | 12 +- arch/m68k/kernel/sys_m68k.c | 14 +- arch/m68k/mm/fault.c | 8 +- arch/microblaze/mm/fault.c | 12 +- arch/mips/kernel/traps.c | 4 +- arch/mips/kernel/vdso.c | 4 +- arch/mips/mm/fault.c | 10 +- arch/nds32/kernel/vdso.c | 6 +- arch/nds32/mm/fault.c | 12 +- arch/nios2/mm/fault.c | 12 +- arch/nios2/mm/init.c | 4 +- arch/openrisc/mm/fault.c | 10 +- arch/parisc/kernel/traps.c | 6 +- arch/parisc/mm/fault.c | 8 +- arch/powerpc/kernel/vdso.c | 6 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 12 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/powerpc/mm/book3s64/iommu_api.c | 4 +- arch/powerpc/mm/book3s64/subpage_prot.c | 12 +- arch/powerpc/mm/copro_fault.c | 4 +- arch/powerpc/mm/fault.c | 12 +- arch/powerpc/oprofile/cell/spu_task_sync.c | 6 +- arch/powerpc/platforms/cell/spufs/file.c | 4 +- arch/riscv/kernel/vdso.c | 4 +- arch/riscv/mm/fault.c | 10 +- arch/s390/kernel/vdso.c | 4 +- arch/s390/kvm/gaccess.c | 4 +- arch/s390/kvm/kvm-s390.c | 24 +- arch/s390/kvm/priv.c | 32 +- arch/s390/mm/fault.c | 6 +- arch/s390/mm/gmap.c | 40 +- arch/s390/pci/pci_mmio.c | 4 +- arch/sh/kernel/sys_sh.c | 6 +- arch/sh/kernel/vsyscall/vsyscall.c | 4 +- arch/sh/mm/fault.c | 14 +- arch/sparc/mm/fault_32.c | 18 +- arch/sparc/mm/fault_64.c | 12 +- arch/sparc/vdso/vma.c | 4 +- arch/um/include/asm/mmu_context.h | 6 +- arch/um/kernel/tlb.c | 2 +- arch/um/kernel/trap.c | 6 +- arch/unicore32/mm/fault.c | 6 +- arch/x86/entry/vdso/vma.c | 10 +- arch/x86/kernel/tboot.c | 2 +- arch/x86/kernel/vm86_32.c | 4 +- arch/x86/kvm/mmu/paging_tmpl.h | 8 +- arch/x86/mm/debug_pagetables.c | 8 +- arch/x86/mm/fault.c | 110 ++- arch/x86/mm/mpx.c | 15 +- arch/x86/um/vdso/vma.c | 4 +- arch/xtensa/mm/fault.c | 10 +- drivers/android/binder_alloc.c | 10 +- drivers/firmware/efi/efi.c | 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 10 +- drivers/gpu/drm/amd/amdkfd/kfd_events.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 8 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 20 +- drivers/gpu/drm/radeon/radeon_cs.c | 4 +- drivers/gpu/drm/radeon/radeon_gem.c | 6 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 4 +- drivers/infiniband/core/umem.c | 6 +- drivers/infiniband/core/umem_odp.c | 10 +- drivers/infiniband/core/uverbs_main.c | 4 +- drivers/infiniband/hw/mlx4/mr.c | 4 +- drivers/infiniband/hw/qib/qib_user_pages.c | 6 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +- drivers/infiniband/sw/siw/siw_mem.c | 4 +- drivers/iommu/amd_iommu_v2.c | 4 +- drivers/iommu/intel-svm.c | 4 +- drivers/media/v4l2-core/videobuf-core.c | 4 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 4 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 4 +- drivers/misc/cxl/cxllib.c | 4 +- drivers/misc/cxl/fault.c | 4 +- drivers/misc/sgi-gru/grufault.c | 16 +- drivers/misc/sgi-gru/grufile.c | 4 +- drivers/oprofile/buffer_sync.c | 10 +- drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +- drivers/tee/optee/call.c | 4 +- drivers/vfio/vfio_iommu_type1.c | 12 +- drivers/xen/gntdev.c | 4 +- drivers/xen/privcmd.c | 14 +- fs/aio.c | 16 +- fs/coredump.c | 4 +- fs/exec.c | 16 +- fs/ext4/file.c | 1 + fs/io_uring.c | 4 +- fs/proc/base.c | 18 +- fs/proc/task_mmu.c | 28 +- fs/proc/task_nommu.c | 18 +- fs/userfaultfd.c | 28 +- include/linux/hugetlb.h | 5 +- include/linux/mm.h | 56 +- include/linux/mm_lock.h | 285 ++++++++ include/linux/mm_types.h | 22 + include/linux/mm_types_task.h | 21 + include/linux/mmu_notifier.h | 5 +- include/linux/pagemap.h | 7 +- include/linux/sched.h | 2 + init/init_task.c | 1 + ipc/shm.c | 11 +- kernel/acct.c | 4 +- kernel/bpf/stackmap.c | 32 +- kernel/events/core.c | 4 +- kernel/events/uprobes.c | 16 +- kernel/exit.c | 8 +- kernel/fork.c | 17 +- kernel/futex.c | 4 +- kernel/sched/fair.c | 4 +- kernel/sys.c | 18 +- kernel/trace/trace_output.c | 4 +- mm/Kconfig | 25 + mm/Makefile | 2 + mm/filemap.c | 10 +- mm/frame_vector.c | 4 +- mm/gup.c | 20 +- mm/hugetlb.c | 13 +- mm/init-mm.c | 3 +- mm/internal.h | 2 +- mm/khugepaged.c | 37 +- mm/ksm.c | 34 +- mm/madvise.c | 18 +- mm/memcontrol.c | 8 +- mm/memory.c | 55 +- mm/mempolicy.c | 22 +- mm/migrate.c | 8 +- mm/mincore.c | 4 +- mm/mlock.c | 16 +- mm/mm_lock_range.c | 691 ++++++++++++++++++ mm/mm_lock_rwsem_checked.c | 134 ++++ mm/mmap.c | 170 +++-- mm/mmu_notifier.c | 4 +- mm/mprotect.c | 12 +- mm/mremap.c | 6 +- mm/msync.c | 8 +- mm/nommu.c | 36 +- mm/oom_kill.c | 4 +- mm/process_vm_access.c | 4 +- mm/shmem.c | 1 + mm/swap_state.c | 6 + mm/swapfile.c | 4 +- mm/userfaultfd.c | 14 +- mm/util.c | 14 +- net/ipv4/tcp.c | 4 +- net/xdp/xdp_umem.c | 4 +- virt/kvm/arm/mmu.c | 14 +- virt/kvm/async_pf.c | 4 +- virt/kvm/kvm_main.c | 8 +- 170 files changed, 2183 insertions(+), 798 deletions(-) create mode 100644 include/linux/mm_lock.h create mode 100644 mm/mm_lock_range.c create mode 100644 mm/mm_lock_rwsem_checked.c -- 2.25.0.341.g760bfbb309-goog