Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Sun, 4 Aug 2024 16:22:16 -0700

On Sat, Aug 3, 2024 at 1:53 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Aug 02, 2024 at 10:47:15PM -0700, Andrii Nakryiko wrote:
>
> > Is there any reason why the approach below won't work?
>
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 8be9e34e786a..e21b68a39f13 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -2251,6 +2251,52 @@ static struct uprobe
> > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> >         struct uprobe *uprobe = NULL;
> >         struct vm_area_struct *vma;
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +       vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags;
> > +       struct file *vm_file;
> > +       struct inode *vm_inode;
> > +       unsigned long vm_pgoff, vm_start, vm_end;
> > +       int vm_lock_seq;
> > +       loff_t offset;
> > +
> > +       rcu_read_lock();
> > +
> > +       vma = vma_lookup(mm, bp_vaddr);
> > +       if (!vma)
> > +               goto retry_with_lock;
> > +
> > +       vm_lock_seq = READ_ONCE(vma->vm_lock_seq);
>
> So vma->vm_lock_seq is only updated on vma_start_write()

yep, I've looked a bit more at the implementation now

>
> > +
> > +       vm_file = READ_ONCE(vma->vm_file);
> > +       vm_flags = READ_ONCE(vma->vm_flags);
> > +       if (!vm_file || (vm_flags & flags) != VM_MAYEXEC)
> > +               goto retry_with_lock;
> > +
> > +       vm_inode = READ_ONCE(vm_file->f_inode);
> > +       vm_pgoff = READ_ONCE(vma->vm_pgoff);
> > +       vm_start = READ_ONCE(vma->vm_start);
> > +       vm_end = READ_ONCE(vma->vm_end);
>
> None of those are written with WRITE_ONCE(), so this buys you nothing.
> Compiler could be updating them one byte at a time while you load some
> franken-update.
>
> Also, if you're in the middle of split_vma() you might not get a
> consistent set.

I used READ_ONCE() only to prevent the compiler from re-reading those
values. We assume those values are garbage anyways and double-check
everything, so lack of WRITE_ONCE doesn't matter. Same for
inconsistency if we are in the middle of split_vma().

We use the result of all this speculative calculation only if we find
a valid uprobe (which could be a false positive) *and* if we detect
that nothing about VMA changed (which is what I got wrong, but
honestly I was actually betting on others to help me get this right
anyways).

>
> > +       if (bp_vaddr < vm_start || bp_vaddr >= vm_end)
> > +               goto retry_with_lock;
> > +
> > +       offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > +       uprobe = find_uprobe_rcu(vm_inode, offset);
> > +       if (!uprobe)
> > +               goto retry_with_lock;
> > +
> > +       /* now double check that nothing about VMA changed */
> > +       if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq))
> > +               goto retry_with_lock;
>
> Since vma->vma_lock_seq is only ever updated at vma_start_write() you're
> checking you're in or after the same modification cycle.
>
> The point of sequence locks is to check you *IN* a modification cycle
> and retry if you are. You're now explicitly continuing if you're in a
> modification.
>
> You really need:
>
>    seq++;
>    wmb();
>
>    ... do modification
>
>    wmb();
>    seq++;
>
> vs
>
>   do {
>           s = READ_ONCE(seq) & ~1;
>           rmb();
>
>           ... read stuff
>
>   } while (rmb(), seq != s);
>
>
> The thing to note is that seq will be odd while inside a modification
> and even outside, further if the pre and post seq are both even but not
> identical, you've crossed a modification and also need to retry.
>

Ok, I don't think I got everything you have written above, sorry. But
let me explain what I think I need to do and please correct what I
(still) got wrong.

a) before starting speculation,
  a.1) read and remember current->mm->mm_lock_seq (using
smp_load_acquire(), right?)
  a.2) read vma->vm_lock_seq (using smp_load_acquire() I presume)
  a.3) if vm_lock_seq is odd, we are already modifying VMA, so bail
out, try with proper mmap_lock
b) proceed with the inode pointer fetch and offset calculation as I've coded it
c) lookup uprobe by inode+offset, if failed -- bail out (if succeeded,
this could still be wrong)
d) re-read vma->vm_lock_seq, if it changed, we started modifying/have
already modified VMA, bail out
e) re-read mm->mm_lock_seq, if that changed -- presume VMA got
modified, bail out

At this point we should have a guarantee that nothing about mm
changed, nor that VMA started being modified during our speculative
calculation+uprobe lookup. So if we found a valid uprobe, it must be a
correct one that we need.

Is that enough? Any holes in the approach? And thanks for thoroughly
thinking about this, btw!

P.S. This is basically the last big blocker towards linear uprobes
scalability with the number of active CPUs. I have
uretprobe+SRCU+timeout implemented and it seems to work fine, will
post soon-ish.

P.P.S Also, funny enough, below was another big scalability limiter
(and the last one) :) I'm not sure if we can just drop it, or I should
use per-CPU counter, but with the below change and speculative VMA
lookup (however buggy, works ok for benchmarking), I finally get
linear scaling of uprobe triggering throughput with number of CPUs. We
are very close.

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index f7443e996b1b..64c2bc316a08 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1508,7 +1508,7 @@ static int uprobe_dispatcher(struct
uprobe_consumer *con, struct pt_regs *regs)
        int ret = 0;

        tu = container_of(con, struct trace_uprobe, consumer);
-       tu->nhit++;
+       //tu->nhit++;

        udd.tu = tu;
        udd.bp_addr = instruction_pointer(regs);


> > +
> > +       /* happy case, we speculated successfully */
> > +       rcu_read_unlock();
> > +       return uprobe;
> > +
> > +retry_with_lock:
> > +       rcu_read_unlock();
> > +       uprobe = NULL;
> > +#endif
> > +
> >         mmap_read_lock(mm);
> >         vma = vma_lookup(mm, bp_vaddr);
> >         if (vma) {
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index cc760491f201..211a84ee92b4 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
> >                         NULL);
> >         files_cachep = kmem_cache_create("files_cache",
> >                         sizeof(struct files_struct), 0,
> > -                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
> >                         NULL);
> >         fs_cachep = kmem_cache_create("fs_cache",
> >                         sizeof(struct fs_struct), 0,