static inline void vma_start_write(struct vm_area_struct *vma) { int mm_lock_seq; if (__is_vma_write_locked(vma, &mm_lock_seq)) return; down_write(&vma->vm_lock->lock); /* * We should use WRITE_ONCE() here because we can have concurrent reads * from the early lockless pessimistic check in vma_start_read(). * We don't really care about the correctness of that early check, but * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); + smp_wmb(); up_write(&vma->vm_lock->lock); } Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not enough because it's one-way permeable (it's a "RELEASE operation") and later vma->vm_file store (or any other VMA modification) can move before our vma->vm_lock_seq store. This makes vma_start_write() heavier but again, it's write-locking, so should not be considered a fast path. With this change we can use the code suggested by Andrii in https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjcA2+6k69Q@xxxxxxxxxxxxxx/ with an additional smp_rmb(): rcu_read_lock() vma = find_vma(...) if (!vma) /* bail */ vm_lock_seq = smp_load_acquire(&vma->vm_lock_seq); mm_lock_seq = smp_load_acquire(&vma->mm->mm_lock_seq); /* I think vm_lock has to be acquired first to avoid the race */ if (mm_lock_seq == vm_lock_seq) /* bail, vma is write-locked */ ... perform uprobe lookup logic based on vma->vm_file->f_inode ... smp_rmb(); if (vma->vm_lock_seq != vm_lock_seq) /* bail, VMA might have changed */ The smp_rmb() is needed so that vma->vm_lock_seq load does not get reordered and moved up before speculation. I'm CC'ing Jann since he understands memory barriers way better than me and will keep me honest. > > per-vma locking is still *locking*. Which means memory sharing between > multiple CPUs, which means limited scalability. Lots of work in this > series went to avoid even refcounting (as I pointed out for > find_uprobe_rcu()) due to the same reason, and so relying on per-VMA > locking is just shifting the bottleneck from mmap_lock to > vma->vm_lock. Worst (and not uncommon) case is the same uprobe in the > same process (and thus vma) being hit on multiple CPUs at the same > time. Whether that's protected by mmap_lock or vma->vm_lock is > immaterial at that point (from scalability standpoint). > > > > > I presume a down_read on vma around all the work would also sort out any > > issues concerning stability of the file or inode objects. > > > > Of course single-threaded performance would take a hit due to atomic > > stemming from down/up_read and parallel uprobe lookups on the same vma > > would also get slower, but I don't know if that's a problem for a real > > workload. > > > > I would not have any comments if all speed ups were achieved without > > modifying non-uprobe code. > > I'm also not a mm-focused person, so I'll let Suren and others address > mm-specific concerns, but I (hopefully) addressed all the > uprobe-related questions and concerns you had.