Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Wed, 7 Aug 2024 11:33:49 -0700

On Wed, Aug 7, 2024 at 11:04 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Wed, Aug 7, 2024 at 5:49 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Tue, Aug 6, 2024 at 10:13 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Aug 6, 2024 at 6:36 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 9:08 PM Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > >
> > > > > On Sun, Aug 4, 2024 at 4:22 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Sat, Aug 3, 2024 at 1:53 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 02, 2024 at 10:47:15PM -0700, Andrii Nakryiko wrote:
> > > > > > >
> > > > > > > > Is there any reason why the approach below won't work?
> > > > > > >
> > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > index 8be9e34e786a..e21b68a39f13 100644
> > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > @@ -2251,6 +2251,52 @@ static struct uprobe
> > > > > > > > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> > > > > > > >         struct uprobe *uprobe = NULL;
> > > > > > > >         struct vm_area_struct *vma;
> > > > > > > >
> > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > > > > +       vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags;
> > > > > > > > +       struct file *vm_file;
> > > > > > > > +       struct inode *vm_inode;
> > > > > > > > +       unsigned long vm_pgoff, vm_start, vm_end;
> > > > > > > > +       int vm_lock_seq;
> > > > > > > > +       loff_t offset;
> > > > > > > > +
> > > > > > > > +       rcu_read_lock();
> > > > > > > > +
> > > > > > > > +       vma = vma_lookup(mm, bp_vaddr);
> > > > > > > > +       if (!vma)
> > > > > > > > +               goto retry_with_lock;
> > > > > > > > +
> > > > > > > > +       vm_lock_seq = READ_ONCE(vma->vm_lock_seq);
> > > > > > >
> > > > > > > So vma->vm_lock_seq is only updated on vma_start_write()
> > > > > >
> > > > > > yep, I've looked a bit more at the implementation now
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +       vm_file = READ_ONCE(vma->vm_file);
> > > > > > > > +       vm_flags = READ_ONCE(vma->vm_flags);
> > > > > > > > +       if (!vm_file || (vm_flags & flags) != VM_MAYEXEC)
> > > > > > > > +               goto retry_with_lock;
> > > > > > > > +
> > > > > > > > +       vm_inode = READ_ONCE(vm_file->f_inode);
> > > > > > > > +       vm_pgoff = READ_ONCE(vma->vm_pgoff);
> > > > > > > > +       vm_start = READ_ONCE(vma->vm_start);
> > > > > > > > +       vm_end = READ_ONCE(vma->vm_end);
> > > > > > >
> > > > > > > None of those are written with WRITE_ONCE(), so this buys you nothing.
> > > > > > > Compiler could be updating them one byte at a time while you load some
> > > > > > > franken-update.
> > > > > > >
> > > > > > > Also, if you're in the middle of split_vma() you might not get a
> > > > > > > consistent set.
> > > > > >
> > > > > > I used READ_ONCE() only to prevent the compiler from re-reading those
> > > > > > values. We assume those values are garbage anyways and double-check
> > > > > > everything, so lack of WRITE_ONCE doesn't matter. Same for
> > > > > > inconsistency if we are in the middle of split_vma().
> > > > > >
> > > > > > We use the result of all this speculative calculation only if we find
> > > > > > a valid uprobe (which could be a false positive) *and* if we detect
> > > > > > that nothing about VMA changed (which is what I got wrong, but
> > > > > > honestly I was actually betting on others to help me get this right
> > > > > > anyways).
> > > > > >
> > > > > > >
> > > > > > > > +       if (bp_vaddr < vm_start || bp_vaddr >= vm_end)
> > > > > > > > +               goto retry_with_lock;
> > > > > > > > +
> > > > > > > > +       offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > > > > > > +       uprobe = find_uprobe_rcu(vm_inode, offset);
> > > > > > > > +       if (!uprobe)
> > > > > > > > +               goto retry_with_lock;
> > > > > > > > +
> > > > > > > > +       /* now double check that nothing about VMA changed */
> > > > > > > > +       if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq))
> > > > > > > > +               goto retry_with_lock;
> > > > > > >
> > > > > > > Since vma->vma_lock_seq is only ever updated at vma_start_write() you're
> > > > > > > checking you're in or after the same modification cycle.
> > > > > > >
> > > > > > > The point of sequence locks is to check you *IN* a modification cycle
> > > > > > > and retry if you are. You're now explicitly continuing if you're in a
> > > > > > > modification.
> > > > > > >
> > > > > > > You really need:
> > > > > > >
> > > > > > >    seq++;
> > > > > > >    wmb();
> > > > > > >
> > > > > > >    ... do modification
> > > > > > >
> > > > > > >    wmb();
> > > > > > >    seq++;
> > > > > > >
> > > > > > > vs
> > > > > > >
> > > > > > >   do {
> > > > > > >           s = READ_ONCE(seq) & ~1;
> > > > > > >           rmb();
> > > > > > >
> > > > > > >           ... read stuff
> > > > > > >
> > > > > > >   } while (rmb(), seq != s);
> > > > > > >
> > > > > > >
> > > > > > > The thing to note is that seq will be odd while inside a modification
> > > > > > > and even outside, further if the pre and post seq are both even but not
> > > > > > > identical, you've crossed a modification and also need to retry.
> > > > > > >
> > > > > >
> > > > > > Ok, I don't think I got everything you have written above, sorry. But
> > > > > > let me explain what I think I need to do and please correct what I
> > > > > > (still) got wrong.
> > > > > >
> > > > > > a) before starting speculation,
> > > > > >   a.1) read and remember current->mm->mm_lock_seq (using
> > > > > > smp_load_acquire(), right?)
> > > > > >   a.2) read vma->vm_lock_seq (using smp_load_acquire() I presume)
> > > > > >   a.3) if vm_lock_seq is odd, we are already modifying VMA, so bail
> > > > > > out, try with proper mmap_lock
> > > > > > b) proceed with the inode pointer fetch and offset calculation as I've coded it
> > > > > > c) lookup uprobe by inode+offset, if failed -- bail out (if succeeded,
> > > > > > this could still be wrong)
> > > > > > d) re-read vma->vm_lock_seq, if it changed, we started modifying/have
> > > > > > already modified VMA, bail out
> > > > > > e) re-read mm->mm_lock_seq, if that changed -- presume VMA got
> > > > > > modified, bail out
> > > > > >
> > > > > > At this point we should have a guarantee that nothing about mm
> > > > > > changed, nor that VMA started being modified during our speculative
> > > > > > calculation+uprobe lookup. So if we found a valid uprobe, it must be a
> > > > > > correct one that we need.
> > > > > >
> > > > > > Is that enough? Any holes in the approach? And thanks for thoroughly
> > > > > > thinking about this, btw!
> > > > >
> > > > > Ok, with slight modifications to the details of the above (e.g., there
> > > > > is actually no "odd means VMA is being modified" thing with
> > > > > vm_lock_seq),
> > > >
> > > > Correct. Instead of that (vm_lock_seq->vm_lock_seq == mm->mm_lock_seq)
> > > > means your VMA is write-locked and is being modified.
> > > >
> > > > > I ended up with the implementation below. Basically we
> > > > > validate that mm->mm_lock_seq didn't change and that vm_lock_seq !=
> > > > > mm_lock_seq (which otherwise would mean "VMA is being modified").
> > > >
> > > > Validating that mm->mm_lock_seq did not change does not provide you
> > > > with useful information. It only means that between the point where
> > > > you recorded mm->mm_lock_seq and where you are checking it, there was
> > > > an mmap_write_unlock() or mmap_write_downgrade() call. Your VMA might
> > > > not have even been part of that modification for which mmap_lock was
> > > > taken.
> > > >
> > > > In theory what you need is simpler (simplified code for explanation only):
> > > >
> > > > int vm_lock_seq = vma->vm_lock_seq;
> > > > if (vm_lock_seq == mm->mm_lock_seq)
> > > >         goto bail_out; /* VMA is write-locked */
> > > >
> > > > /* copy required VMA attributes */
> > > >
> > > > if (vm_lock_seq != vma->vm_lock_seq)
> > > >         goto bail_out; /* VMA got write-locked */
> > > >
> > > > But this would require proper ACQUIRE/RELEASE semantics for
> > > > vma->vm_lock_seq which is currently not there because all reads/writes
> > > > to vma->vm_lock_seq that matter are done under vma->vm_lock->lock
> > > > protection, so additional ordering is not required. If you decide to
> > > > add that semantics for vma->vm_lock_seq, please make sure that
> > > > pagefault path performance does not regress.
> > > >
> > > > > There is a possibility that vm_lock_seq == mm_lock_seq just by
> > > > > accident, which is not a correctness problem, we'll just fallback to
> > > > > locked implementation until something about VMA or mm_struct itself
> > > > > changes. Which is fine, and if mm folks ever change this locking
> > > > > schema, this might go away.
> > > > >
> > > > > If this seems on the right track, I think we can just move
> > > > > mm_start_vma_specuation()/mm_end_vma_speculation() into
> > > > > include/linux/mm.h.
> > > > >
> > > > > And after thinking a bit more about READ_ONCE() usage, I changed them
> > > > > to data_race() to not trigger KCSAN warnings. Initially I kept
> > > > > READ_ONCE() only around vma->vm_file access, but given we never change
> > > > > it until vma is freed and reused (which would be prevented by
> > > > > guard(rcu)), I dropped READ_ONCE() and only added data_race(). And
> > > > > even data_race() is probably not necessary.
> > > > >
> > > > > Anyways, please see the patch below. Would be nice if mm folks
> > > > > (Suren?) could confirm that this is not broken.
> > > > >
> > > > >
> > > > >
> > > > > Author: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > > > Date:   Fri Aug 2 22:16:40 2024 -0700
> > > > >
> > > > >     uprobes: add speculative lockless VMA to inode resolution
> > > > >
> > > > >     Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > > >
> > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > index 3de311c56d47..bee7a929ff02 100644
> > > > > --- a/kernel/events/uprobes.c
> > > > > +++ b/kernel/events/uprobes.c
> > > > > @@ -2244,6 +2244,70 @@ static int is_trap_at_addr(struct mm_struct
> > > > > *mm, unsigned long vaddr)
> > > > >         return is_trap_insn(&opcode);
> > > > >  }
> > > > >
> > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > +static inline void mm_start_vma_speculation(struct mm_struct *mm, int
> > > > > *mm_lock_seq)
> > > > > +{
> > > > > +       *mm_lock_seq = smp_load_acquire(&mm->mm_lock_seq);
> > > > > +}
> > > > > +
> > > > > +/* returns true if speculation was safe (no mm and vma modification
> > > > > happened) */
> > > > > +static inline bool mm_end_vma_speculation(struct vm_area_struct *vma,
> > > > > int mm_lock_seq)
> > > > > +{
> > > > > +       int mm_seq, vma_seq;
> > > > > +
> > > > > +       mm_seq = smp_load_acquire(&vma->vm_mm->mm_lock_seq);
> > > > > +       vma_seq = READ_ONCE(vma->vm_lock_seq);
> > > > > +
> > > > > +       return mm_seq == mm_lock_seq && vma_seq != mm_seq;
> > > >
> > > > After spending some time on this I think what you do here is
> > > > semantically correct but sub-optimal.
> > >
> >
> > Yes, requiring that mm_lock_seq doesn't change is too pessimistic, but
> > relative to the frequency of uprobe/uretprobe triggering (and how fast
> > the lookup is) this won't matter much. Absolute majority of uprobe
> > lookups will manage to succeed while none of mm's VMAs change at all.
> > So I felt like that's ok, at least for starters.
> >
> > My goal is to minimize intrusion into purely mm-related code, this
> > whole uprobe work is already pretty large and sprawling, I don't want
> > to go on another quest to change locking semantics for vma, if I don't
> > absolutely have to :) But see below for adjusted logic based on your
> > comments.
> >
> > > Actually, after staring at this code some more I think
> > > vma->vm_lock_seq not having proper ACQUIRE/RELEASE semantics would
> > > bite us here as well. The entire find_active_uprobe_speculative()
> > > might be executing while mmap_lock is write-locked (so, mm_seq ==
> > > mm_lock_seq is satisfied) and we might miss that the VMA is locked due
> > > to vma->vm_lock_seq read/write reordering. Though it's late and I
> > > might have missed some memory barriers which would prevent this
> > > scenario...
> >
> > So, please bear with me, if it's a stupid question. But don't all
> > locks have implicit ACQUIRE and RELEASE semantics already? At least
> > that's my reading of Documentation/memory-barriers.txt.
> >
> > So with that, wouldn't it be OK to just change
> > READ_ONCE(vma->vm_lock_seq) to smp_load_acquire(&vma->vm_lock_seq) and
> > mitigate the issue you pointed out?
> >
> >
> > So maybe something like below:
> >
> > rcu_read_lock()
> >
> > vma = find_vma(...)
> > if (!vma) /* bail */
> >
> > vm_lock_seq = smp_load_acquire(&vma->vm_lock_seq);
> > mm_lock_seq = smp_load_acquire(&vma->mm->mm_lock_seq);
> > /* I think vm_lock has to be acquired first to avoid the race */
> > if (mm_lock_seq == vm_lock_seq)
> >     /* bail, vma is write-locked */
> >
> > ... perform uprobe lookup logic based on vma->vm_file->f_inode ...
> >
> > if (smp_load_acquire(&vma->vm_lock_seq) != vm_lock_seq)
> >     /* bail, VMA might have changed */
> >
> > Thoughts?
>
> Hi Andrii,
> I've prepared a quick patch following Peter's suggestion in [1] to
> make mm->mm_lock_seq a proper seqcount. I'll post it shortly as RFC so
> you can try it out. I think that would be a much cleaner solution.
> I'll post a link to it shortly.

The RFC is posted at
https://lore.kernel.org/all/20240807182325.2585582-1-surenb@xxxxxxxxxx/.
With that patch you can do:

bool success = false;
int seq;

if (!mmap_lock_speculation_start(mm, &seq)) /* bail out */

rcu_read_lock()
vma = find_vma(...)
if (!vma) /* rcu_read_unlock and bail out */
/* obtain vma->vm_file->f_inode */
rcu_read_unlock();

if (!mmap_lock_speculation_end(mm, seq)) /* bail out */

> Thanks,
> Suren.
>
> [1] https://lore.kernel.org/all/20240730134605.GO33588@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
>
> >
> > >
> > > > This check means that there was no call to
> > > > mmap_write_unlock()/mmap_write_downgrade() since
> > > > mm_start_vma_speculation() and the vma is not currently locked. To
> > > > unlock a write-locked VMA you do need to call
> > > > map_write_unlock()/mmap_write_downgrade(), so I think this check would
> > > > guarantee that your vma was not locked and modified from under us.
> > > > However this will also trigger false positives if
> > > > mmap_write_unlock()/mmap_write_downgrade() was called but the vma you
> > > > are using was never locked. So, it will bail out more than necessary.
> > > > Maybe it's ok?
> > > >
> > > > > +}
> > > > > +
> >
> > [...]