Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 15 Aug 2024 13:17:05 -0700

On Thu, Aug 15, 2024 at 11:58 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> +brauner for "struct file" lifetime
>
> On Thu, Aug 15, 2024 at 7:45 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > On Thu, Aug 15, 2024 at 9:47 AM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Thu, Aug 15, 2024 at 6:44 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote:
> > > > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > > > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
> > > > > > > attempting uprobe look up speculatively.
>
> Stupid question: Is this uprobe stuff actually such a hot codepath
> that it makes sense to optimize it to be faster than the page fault
> path?

Not a stupid question, but yes, generally speaking uprobe performance
is critical for a bunch of tracing use cases. And having independent
threads implicitly contending with each other just because of uprobe's
internal implementation detail (while conceptually there should be no
dependencies for triggering uprobe from multiple parallel threads) is
a big surprise to users and affects production use cases beyond just
uprobe-handling BPF logic overhead ("useful overhead") they assume.

>
> (Sidenote: I find it kinda interesting that this is sort of going back
> in the direction of the old Speculative Page Faults design.)
>
> > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() helpers to
> > > > > > > validate that mm_struct stays intact for entire duration of this
> > > > > > > speculation. If not, we fall back to mmap_lock-protected lookup.
> > > > > > >
> > > > > > > This allows to avoid contention on mmap_lock in absolutely majority of
> > > > > > > cases, nicely improving uprobe/uretprobe scalability.
> > > > > > >
> > > > > >

[...]

> > Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not
> > enough because it's one-way permeable (it's a "RELEASE operation") and
> > later vma->vm_file store (or any other VMA modification) can move
> > before our vma->vm_lock_seq store.
> >
> > This makes vma_start_write() heavier but again, it's write-locking, so
> > should not be considered a fast path.
> > With this change we can use the code suggested by Andrii in
> > https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjcA2+6k69Q@xxxxxxxxxxxxxx/
> > with an additional smp_rmb():
> >
> > rcu_read_lock()
> > vma = find_vma(...)
> > if (!vma) /* bail */
>
> And maybe add some comments like:
>
> /*
>  * Load the current VMA lock sequence - we will detect if anyone concurrently
>  * locks the VMA after this point.
>  * Pairs with smp_wmb() in vma_start_write().
>  */
> > vm_lock_seq = smp_load_acquire(&vma->vm_lock_seq);
> /*
>  * Now we just have to detect if the VMA is already locked with its current
>  * sequence count.
>  *
>  * The following load is ordered against the vm_lock_seq load above (using
>  * smp_load_acquire() for the load above), and pairs with implicit memory
>  * ordering between the mm_lock_seq write in mmap_write_unlock() and the
>  * vm_lock_seq write in the next vma_start_write() after that (which can only
>  * occur after an mmap_write_lock()).
>  */
> > mm_lock_seq = smp_load_acquire(&vma->mm->mm_lock_seq);
> > /* I think vm_lock has to be acquired first to avoid the race */
> > if (mm_lock_seq == vm_lock_seq)
> >         /* bail, vma is write-locked */
> > ... perform uprobe lookup logic based on vma->vm_file->f_inode ...
> /*
>  * Order the speculative accesses above against the following vm_lock_seq
>  * recheck.
>  */
> > smp_rmb();
> > if (vma->vm_lock_seq != vm_lock_seq)
>

thanks, will incorporate these comments into the next revision

> (As I said on the other thread: Since this now relies on
> vma->vm_lock_seq not wrapping back to the same value for correctness,
> I'd like to see vma->vm_lock_seq being at least an "unsigned long", or
> even better, an atomic64_t... though I realize we don't currently do
> that for seqlocks either.)
>
> >         /* bail, VMA might have changed */
> >
> > The smp_rmb() is needed so that vma->vm_lock_seq load does not get
> > reordered and moved up before speculation.
> >
> > I'm CC'ing Jann since he understands memory barriers way better than
> > me and will keep me honest.