[no subject]

**Date** **Thread**

static inline void vma_start_write(struct vm_area_struct *vma)
{
        int mm_lock_seq;

        if (__is_vma_write_locked(vma, &mm_lock_seq))
                return;

        down_write(&vma->vm_lock->lock);
        /*
         * We should use WRITE_ONCE() here because we can have concurrent reads
         * from the early lockless pessimistic check in vma_start_read().
         * We don't really care about the correctness of that early check, but
         * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
         */
        WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
+        smp_wmb();
        up_write(&vma->vm_lock->lock);
}

Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not
enough because it's one-way permeable (it's a "RELEASE operation") and
later vma->vm_file store (or any other VMA modification) can move
before our vma->vm_lock_seq store.

This makes vma_start_write() heavier but again, it's write-locking, so
should not be considered a fast path.
With this change we can use the code suggested by Andrii in
https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjcA2+6k69Q@xxxxxxxxxxxxxx/
with an additional smp_rmb():

rcu_read_lock()
vma = find_vma(...)
if (!vma) /* bail */

vm_lock_seq = smp_load_acquire(&vma->vm_lock_seq);
mm_lock_seq = smp_load_acquire(&vma->mm->mm_lock_seq);
/* I think vm_lock has to be acquired first to avoid the race */
if (mm_lock_seq == vm_lock_seq)
        /* bail, vma is write-locked */
... perform uprobe lookup logic based on vma->vm_file->f_inode ...
smp_rmb();
if (vma->vm_lock_seq != vm_lock_seq)
        /* bail, VMA might have changed */

The smp_rmb() is needed so that vma->vm_lock_seq load does not get
reordered and moved up before speculation.

I'm CC'ing Jann since he understands memory barriers way better than
me and will keep me honest.

>
> per-vma locking is still *locking*. Which means memory sharing between
> multiple CPUs, which means limited scalability. Lots of work in this
> series went to avoid even refcounting (as I pointed out for
> find_uprobe_rcu()) due to the same reason, and so relying on per-VMA
> locking is just shifting the bottleneck from mmap_lock to
> vma->vm_lock. Worst (and not uncommon) case is the same uprobe in the
> same process (and thus vma) being hit on multiple CPUs at the same
> time. Whether that's protected by mmap_lock or vma->vm_lock is
> immaterial at that point (from scalability standpoint).
>
> >
> > I presume a down_read on vma around all the work would also sort out any
> > issues concerning stability of the file or inode objects.
> >
> > Of course single-threaded performance would take a hit due to atomic
> > stemming from down/up_read and parallel uprobe lookups on the same vma
> > would also get slower, but I don't know if that's a problem for a real
> > workload.
> >
> > I would not have any comments if all speed ups were achieved without
> > modifying non-uprobe code.
>
> I'm also not a mm-focused person, so I'll let Suren and others address
> mm-specific concerns, but I (hopefully) addressed all the
> uprobe-related questions and concerns you had.