On Mon, Aug 5, 2024 at 9:08 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Sun, Aug 4, 2024 at 4:22 PM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > On Sat, Aug 3, 2024 at 1:53 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > > > On Fri, Aug 02, 2024 at 10:47:15PM -0700, Andrii Nakryiko wrote: > > > > > > > Is there any reason why the approach below won't work? > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > index 8be9e34e786a..e21b68a39f13 100644 > > > > --- a/kernel/events/uprobes.c > > > > +++ b/kernel/events/uprobes.c > > > > @@ -2251,6 +2251,52 @@ static struct uprobe > > > > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb > > > > struct uprobe *uprobe = NULL; > > > > struct vm_area_struct *vma; > > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > + vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags; > > > > + struct file *vm_file; > > > > + struct inode *vm_inode; > > > > + unsigned long vm_pgoff, vm_start, vm_end; > > > > + int vm_lock_seq; > > > > + loff_t offset; > > > > + > > > > + rcu_read_lock(); > > > > + > > > > + vma = vma_lookup(mm, bp_vaddr); > > > > + if (!vma) > > > > + goto retry_with_lock; > > > > + > > > > + vm_lock_seq = READ_ONCE(vma->vm_lock_seq); > > > > > > So vma->vm_lock_seq is only updated on vma_start_write() > > > > yep, I've looked a bit more at the implementation now > > > > > > > > > + > > > > + vm_file = READ_ONCE(vma->vm_file); > > > > + vm_flags = READ_ONCE(vma->vm_flags); > > > > + if (!vm_file || (vm_flags & flags) != VM_MAYEXEC) > > > > + goto retry_with_lock; > > > > + > > > > + vm_inode = READ_ONCE(vm_file->f_inode); > > > > + vm_pgoff = READ_ONCE(vma->vm_pgoff); > > > > + vm_start = READ_ONCE(vma->vm_start); > > > > + vm_end = READ_ONCE(vma->vm_end); > > > > > > None of those are written with WRITE_ONCE(), so this buys you nothing. > > > Compiler could be updating them one byte at a time while you load some > > > franken-update. > > > > > > Also, if you're in the middle of split_vma() you might not get a > > > consistent set. > > > > I used READ_ONCE() only to prevent the compiler from re-reading those > > values. We assume those values are garbage anyways and double-check > > everything, so lack of WRITE_ONCE doesn't matter. Same for > > inconsistency if we are in the middle of split_vma(). > > > > We use the result of all this speculative calculation only if we find > > a valid uprobe (which could be a false positive) *and* if we detect > > that nothing about VMA changed (which is what I got wrong, but > > honestly I was actually betting on others to help me get this right > > anyways). > > > > > > > > > + if (bp_vaddr < vm_start || bp_vaddr >= vm_end) > > > > + goto retry_with_lock; > > > > + > > > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); > > > > + uprobe = find_uprobe_rcu(vm_inode, offset); > > > > + if (!uprobe) > > > > + goto retry_with_lock; > > > > + > > > > + /* now double check that nothing about VMA changed */ > > > > + if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq)) > > > > + goto retry_with_lock; > > > > > > Since vma->vma_lock_seq is only ever updated at vma_start_write() you're > > > checking you're in or after the same modification cycle. > > > > > > The point of sequence locks is to check you *IN* a modification cycle > > > and retry if you are. You're now explicitly continuing if you're in a > > > modification. > > > > > > You really need: > > > > > > seq++; > > > wmb(); > > > > > > ... do modification > > > > > > wmb(); > > > seq++; > > > > > > vs > > > > > > do { > > > s = READ_ONCE(seq) & ~1; > > > rmb(); > > > > > > ... read stuff > > > > > > } while (rmb(), seq != s); > > > > > > > > > The thing to note is that seq will be odd while inside a modification > > > and even outside, further if the pre and post seq are both even but not > > > identical, you've crossed a modification and also need to retry. > > > > > > > Ok, I don't think I got everything you have written above, sorry. But > > let me explain what I think I need to do and please correct what I > > (still) got wrong. > > > > a) before starting speculation, > > a.1) read and remember current->mm->mm_lock_seq (using > > smp_load_acquire(), right?) > > a.2) read vma->vm_lock_seq (using smp_load_acquire() I presume) > > a.3) if vm_lock_seq is odd, we are already modifying VMA, so bail > > out, try with proper mmap_lock > > b) proceed with the inode pointer fetch and offset calculation as I've coded it > > c) lookup uprobe by inode+offset, if failed -- bail out (if succeeded, > > this could still be wrong) > > d) re-read vma->vm_lock_seq, if it changed, we started modifying/have > > already modified VMA, bail out > > e) re-read mm->mm_lock_seq, if that changed -- presume VMA got > > modified, bail out > > > > At this point we should have a guarantee that nothing about mm > > changed, nor that VMA started being modified during our speculative > > calculation+uprobe lookup. So if we found a valid uprobe, it must be a > > correct one that we need. > > > > Is that enough? Any holes in the approach? And thanks for thoroughly > > thinking about this, btw! > > Ok, with slight modifications to the details of the above (e.g., there > is actually no "odd means VMA is being modified" thing with > vm_lock_seq), I ended up with the implementation below. Basically we > validate that mm->mm_lock_seq didn't change and that vm_lock_seq != > mm_lock_seq (which otherwise would mean "VMA is being modified"). > There is a possibility that vm_lock_seq == mm_lock_seq just by > accident, which is not a correctness problem, we'll just fallback to > locked implementation until something about VMA or mm_struct itself > changes. Which is fine, and if mm folks ever change this locking > schema, this might go away. > > If this seems on the right track, I think we can just move > mm_start_vma_specuation()/mm_end_vma_speculation() into > include/linux/mm.h. > > And after thinking a bit more about READ_ONCE() usage, I changed them > to data_race() to not trigger KCSAN warnings. Initially I kept > READ_ONCE() only around vma->vm_file access, but given we never change > it until vma is freed and reused (which would be prevented by > guard(rcu)), I dropped READ_ONCE() and only added data_race(). And > even data_race() is probably not necessary. > > Anyways, please see the patch below. Would be nice if mm folks > (Suren?) could confirm that this is not broken. Hi Andrii, Sorry, I'm catching up on my emails and will get back to this to read through the discussion later today. One obvious thing that is problematic in this whole schema is the fact that vm_file is not RCU-safe as I mentioned before. See below. > > > > Author: Andrii Nakryiko <andrii@xxxxxxxxxx> > Date: Fri Aug 2 22:16:40 2024 -0700 > > uprobes: add speculative lockless VMA to inode resolution > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 3de311c56d47..bee7a929ff02 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -2244,6 +2244,70 @@ static int is_trap_at_addr(struct mm_struct > *mm, unsigned long vaddr) > return is_trap_insn(&opcode); > } > > +#ifdef CONFIG_PER_VMA_LOCK > +static inline void mm_start_vma_speculation(struct mm_struct *mm, int > *mm_lock_seq) > +{ > + *mm_lock_seq = smp_load_acquire(&mm->mm_lock_seq); > +} > + > +/* returns true if speculation was safe (no mm and vma modification > happened) */ > +static inline bool mm_end_vma_speculation(struct vm_area_struct *vma, > int mm_lock_seq) > +{ > + int mm_seq, vma_seq; > + > + mm_seq = smp_load_acquire(&vma->vm_mm->mm_lock_seq); > + vma_seq = READ_ONCE(vma->vm_lock_seq); > + > + return mm_seq == mm_lock_seq && vma_seq != mm_seq; > +} > + > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr) > +{ > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE; > + struct mm_struct *mm = current->mm; > + struct uprobe *uprobe; > + struct vm_area_struct *vma; > + struct file *vm_file; > + struct inode *vm_inode; > + unsigned long vm_pgoff, vm_start; > + int mm_lock_seq; > + loff_t offset; > + > + guard(rcu)(); > + > + mm_start_vma_speculation(mm, &mm_lock_seq); > + > + vma = vma_lookup(mm, bp_vaddr); > + if (!vma) > + return NULL; > + > + vm_file = data_race(vma->vm_file); Consider what happens if remove_vma() is racing with this code and at this point it calls fput(vma->vm_file). Your vma will be valid (it's freed after RCU grace period) but vma->vm_file can be freed from under you. For this to work vma->vm_file should be made RCU-safe. Thanks, Suren. > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC) > + return NULL; > + > + vm_inode = data_race(vm_file->f_inode); > + vm_pgoff = data_race(vma->vm_pgoff); > + vm_start = data_race(vma->vm_start); > + > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); > + uprobe = find_uprobe_rcu(vm_inode, offset); > + if (!uprobe) > + return NULL; > + > + /* now double check that nothing about MM and VMA changed */ > + if (!mm_end_vma_speculation(vma, mm_lock_seq)) > + return NULL; > + > + /* happy case, we speculated successfully */ > + return uprobe; > +} > +#else /* !CONFIG_PER_VMA_LOCK */ > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr) > +{ > + return NULL; > +} > +#endif /* CONFIG_PER_VMA_LOCK */ > + > /* assumes being inside RCU protected region */ > static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, > int *is_swbp) > { > @@ -2251,6 +2315,10 @@ static struct uprobe > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb > struct uprobe *uprobe = NULL; > struct vm_area_struct *vma; > > + uprobe = find_active_uprobe_speculative(bp_vaddr); > + if (uprobe) > + return uprobe; > + > mmap_read_lock(mm); > vma = vma_lookup(mm, bp_vaddr); > if (vma) { > diff --git a/kernel/fork.c b/kernel/fork.c > index cc760491f201..211a84ee92b4 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void) > NULL); > files_cachep = kmem_cache_create("files_cache", > sizeof(struct files_struct), 0, > - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, > + > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU, > NULL); > fs_cachep = kmem_cache_create("fs_cache", > sizeof(struct fs_struct), 0, > > > > > > P.S. This is basically the last big blocker towards linear uprobes > > scalability with the number of active CPUs. I have > > uretprobe+SRCU+timeout implemented and it seems to work fine, will > > post soon-ish. > > > > P.P.S Also, funny enough, below was another big scalability limiter > > (and the last one) :) I'm not sure if we can just drop it, or I should > > use per-CPU counter, but with the below change and speculative VMA > > lookup (however buggy, works ok for benchmarking), I finally get > > linear scaling of uprobe triggering throughput with number of CPUs. We > > are very close. > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > index f7443e996b1b..64c2bc316a08 100644 > > --- a/kernel/trace/trace_uprobe.c > > +++ b/kernel/trace/trace_uprobe.c > > @@ -1508,7 +1508,7 @@ static int uprobe_dispatcher(struct > > uprobe_consumer *con, struct pt_regs *regs) > > int ret = 0; > > > > tu = container_of(con, struct trace_uprobe, consumer); > > - tu->nhit++; > > + //tu->nhit++; > > > > udd.tu = tu; > > udd.bp_addr = instruction_pointer(regs); > > > > > > > > + > > > > + /* happy case, we speculated successfully */ > > > > + rcu_read_unlock(); > > > > + return uprobe; > > > > + > > > > +retry_with_lock: > > > > + rcu_read_unlock(); > > > > + uprobe = NULL; > > > > +#endif > > > > + > > > > mmap_read_lock(mm); > > > > vma = vma_lookup(mm, bp_vaddr); > > > > if (vma) { > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > index cc760491f201..211a84ee92b4 100644 > > > > --- a/kernel/fork.c > > > > +++ b/kernel/fork.c > > > > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void) > > > > NULL); > > > > files_cachep = kmem_cache_create("files_cache", > > > > sizeof(struct files_struct), 0, > > > > - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, > > > > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU, > > > > NULL); > > > > fs_cachep = kmem_cache_create("fs_cache", > > > > sizeof(struct fs_struct), 0,