On Tue, Aug 6, 2024 at 6:36 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Mon, Aug 5, 2024 at 9:08 PM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > On Sun, Aug 4, 2024 at 4:22 PM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > On Sat, Aug 3, 2024 at 1:53 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > > > > > On Fri, Aug 02, 2024 at 10:47:15PM -0700, Andrii Nakryiko wrote: > > > > > > > > > Is there any reason why the approach below won't work? > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > index 8be9e34e786a..e21b68a39f13 100644 > > > > > --- a/kernel/events/uprobes.c > > > > > +++ b/kernel/events/uprobes.c > > > > > @@ -2251,6 +2251,52 @@ static struct uprobe > > > > > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb > > > > > struct uprobe *uprobe = NULL; > > > > > struct vm_area_struct *vma; > > > > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > > + vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags; > > > > > + struct file *vm_file; > > > > > + struct inode *vm_inode; > > > > > + unsigned long vm_pgoff, vm_start, vm_end; > > > > > + int vm_lock_seq; > > > > > + loff_t offset; > > > > > + > > > > > + rcu_read_lock(); > > > > > + > > > > > + vma = vma_lookup(mm, bp_vaddr); > > > > > + if (!vma) > > > > > + goto retry_with_lock; > > > > > + > > > > > + vm_lock_seq = READ_ONCE(vma->vm_lock_seq); > > > > > > > > So vma->vm_lock_seq is only updated on vma_start_write() > > > > > > yep, I've looked a bit more at the implementation now > > > > > > > > > > > > + > > > > > + vm_file = READ_ONCE(vma->vm_file); > > > > > + vm_flags = READ_ONCE(vma->vm_flags); > > > > > + if (!vm_file || (vm_flags & flags) != VM_MAYEXEC) > > > > > + goto retry_with_lock; > > > > > + > > > > > + vm_inode = READ_ONCE(vm_file->f_inode); > > > > > + vm_pgoff = READ_ONCE(vma->vm_pgoff); > > > > > + vm_start = READ_ONCE(vma->vm_start); > > > > > + vm_end = READ_ONCE(vma->vm_end); > > > > > > > > None of those are written with WRITE_ONCE(), so this buys you nothing. > > > > Compiler could be updating them one byte at a time while you load some > > > > franken-update. > > > > > > > > Also, if you're in the middle of split_vma() you might not get a > > > > consistent set. > > > > > > I used READ_ONCE() only to prevent the compiler from re-reading those > > > values. We assume those values are garbage anyways and double-check > > > everything, so lack of WRITE_ONCE doesn't matter. Same for > > > inconsistency if we are in the middle of split_vma(). > > > > > > We use the result of all this speculative calculation only if we find > > > a valid uprobe (which could be a false positive) *and* if we detect > > > that nothing about VMA changed (which is what I got wrong, but > > > honestly I was actually betting on others to help me get this right > > > anyways). > > > > > > > > > > > > + if (bp_vaddr < vm_start || bp_vaddr >= vm_end) > > > > > + goto retry_with_lock; > > > > > + > > > > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); > > > > > + uprobe = find_uprobe_rcu(vm_inode, offset); > > > > > + if (!uprobe) > > > > > + goto retry_with_lock; > > > > > + > > > > > + /* now double check that nothing about VMA changed */ > > > > > + if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq)) > > > > > + goto retry_with_lock; > > > > > > > > Since vma->vma_lock_seq is only ever updated at vma_start_write() you're > > > > checking you're in or after the same modification cycle. > > > > > > > > The point of sequence locks is to check you *IN* a modification cycle > > > > and retry if you are. You're now explicitly continuing if you're in a > > > > modification. > > > > > > > > You really need: > > > > > > > > seq++; > > > > wmb(); > > > > > > > > ... do modification > > > > > > > > wmb(); > > > > seq++; > > > > > > > > vs > > > > > > > > do { > > > > s = READ_ONCE(seq) & ~1; > > > > rmb(); > > > > > > > > ... read stuff > > > > > > > > } while (rmb(), seq != s); > > > > > > > > > > > > The thing to note is that seq will be odd while inside a modification > > > > and even outside, further if the pre and post seq are both even but not > > > > identical, you've crossed a modification and also need to retry. > > > > > > > > > > Ok, I don't think I got everything you have written above, sorry. But > > > let me explain what I think I need to do and please correct what I > > > (still) got wrong. > > > > > > a) before starting speculation, > > > a.1) read and remember current->mm->mm_lock_seq (using > > > smp_load_acquire(), right?) > > > a.2) read vma->vm_lock_seq (using smp_load_acquire() I presume) > > > a.3) if vm_lock_seq is odd, we are already modifying VMA, so bail > > > out, try with proper mmap_lock > > > b) proceed with the inode pointer fetch and offset calculation as I've coded it > > > c) lookup uprobe by inode+offset, if failed -- bail out (if succeeded, > > > this could still be wrong) > > > d) re-read vma->vm_lock_seq, if it changed, we started modifying/have > > > already modified VMA, bail out > > > e) re-read mm->mm_lock_seq, if that changed -- presume VMA got > > > modified, bail out > > > > > > At this point we should have a guarantee that nothing about mm > > > changed, nor that VMA started being modified during our speculative > > > calculation+uprobe lookup. So if we found a valid uprobe, it must be a > > > correct one that we need. > > > > > > Is that enough? Any holes in the approach? And thanks for thoroughly > > > thinking about this, btw! > > > > Ok, with slight modifications to the details of the above (e.g., there > > is actually no "odd means VMA is being modified" thing with > > vm_lock_seq), > > Correct. Instead of that (vm_lock_seq->vm_lock_seq == mm->mm_lock_seq) > means your VMA is write-locked and is being modified. > > > I ended up with the implementation below. Basically we > > validate that mm->mm_lock_seq didn't change and that vm_lock_seq != > > mm_lock_seq (which otherwise would mean "VMA is being modified"). > > Validating that mm->mm_lock_seq did not change does not provide you > with useful information. It only means that between the point where > you recorded mm->mm_lock_seq and where you are checking it, there was > an mmap_write_unlock() or mmap_write_downgrade() call. Your VMA might > not have even been part of that modification for which mmap_lock was > taken. > > In theory what you need is simpler (simplified code for explanation only): > > int vm_lock_seq = vma->vm_lock_seq; > if (vm_lock_seq == mm->mm_lock_seq) > goto bail_out; /* VMA is write-locked */ > > /* copy required VMA attributes */ > > if (vm_lock_seq != vma->vm_lock_seq) > goto bail_out; /* VMA got write-locked */ > > But this would require proper ACQUIRE/RELEASE semantics for > vma->vm_lock_seq which is currently not there because all reads/writes > to vma->vm_lock_seq that matter are done under vma->vm_lock->lock > protection, so additional ordering is not required. If you decide to > add that semantics for vma->vm_lock_seq, please make sure that > pagefault path performance does not regress. > > > There is a possibility that vm_lock_seq == mm_lock_seq just by > > accident, which is not a correctness problem, we'll just fallback to > > locked implementation until something about VMA or mm_struct itself > > changes. Which is fine, and if mm folks ever change this locking > > schema, this might go away. > > > > If this seems on the right track, I think we can just move > > mm_start_vma_specuation()/mm_end_vma_speculation() into > > include/linux/mm.h. > > > > And after thinking a bit more about READ_ONCE() usage, I changed them > > to data_race() to not trigger KCSAN warnings. Initially I kept > > READ_ONCE() only around vma->vm_file access, but given we never change > > it until vma is freed and reused (which would be prevented by > > guard(rcu)), I dropped READ_ONCE() and only added data_race(). And > > even data_race() is probably not necessary. > > > > Anyways, please see the patch below. Would be nice if mm folks > > (Suren?) could confirm that this is not broken. > > > > > > > > Author: Andrii Nakryiko <andrii@xxxxxxxxxx> > > Date: Fri Aug 2 22:16:40 2024 -0700 > > > > uprobes: add speculative lockless VMA to inode resolution > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 3de311c56d47..bee7a929ff02 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -2244,6 +2244,70 @@ static int is_trap_at_addr(struct mm_struct > > *mm, unsigned long vaddr) > > return is_trap_insn(&opcode); > > } > > > > +#ifdef CONFIG_PER_VMA_LOCK > > +static inline void mm_start_vma_speculation(struct mm_struct *mm, int > > *mm_lock_seq) > > +{ > > + *mm_lock_seq = smp_load_acquire(&mm->mm_lock_seq); > > +} > > + > > +/* returns true if speculation was safe (no mm and vma modification > > happened) */ > > +static inline bool mm_end_vma_speculation(struct vm_area_struct *vma, > > int mm_lock_seq) > > +{ > > + int mm_seq, vma_seq; > > + > > + mm_seq = smp_load_acquire(&vma->vm_mm->mm_lock_seq); > > + vma_seq = READ_ONCE(vma->vm_lock_seq); > > + > > + return mm_seq == mm_lock_seq && vma_seq != mm_seq; > > After spending some time on this I think what you do here is > semantically correct but sub-optimal. Actually, after staring at this code some more I think vma->vm_lock_seq not having proper ACQUIRE/RELEASE semantics would bite us here as well. The entire find_active_uprobe_speculative() might be executing while mmap_lock is write-locked (so, mm_seq == mm_lock_seq is satisfied) and we might miss that the VMA is locked due to vma->vm_lock_seq read/write reordering. Though it's late and I might have missed some memory barriers which would prevent this scenario... > This check means that there was no call to > mmap_write_unlock()/mmap_write_downgrade() since > mm_start_vma_speculation() and the vma is not currently locked. To > unlock a write-locked VMA you do need to call > map_write_unlock()/mmap_write_downgrade(), so I think this check would > guarantee that your vma was not locked and modified from under us. > However this will also trigger false positives if > mmap_write_unlock()/mmap_write_downgrade() was called but the vma you > are using was never locked. So, it will bail out more than necessary. > Maybe it's ok? > > > +} > > + > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr) > > +{ > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE; > > + struct mm_struct *mm = current->mm; > > + struct uprobe *uprobe; > > + struct vm_area_struct *vma; > > + struct file *vm_file; > > + struct inode *vm_inode; > > + unsigned long vm_pgoff, vm_start; > > + int mm_lock_seq; > > + loff_t offset; > > + > > + guard(rcu)(); > > + > > + mm_start_vma_speculation(mm, &mm_lock_seq); > > + > > + vma = vma_lookup(mm, bp_vaddr); > > + if (!vma) > > + return NULL; > > + > > + vm_file = data_race(vma->vm_file); > > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC) > > + return NULL; > > + > > + vm_inode = data_race(vm_file->f_inode); > > + vm_pgoff = data_race(vma->vm_pgoff); > > + vm_start = data_race(vma->vm_start); > > + > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); > > + uprobe = find_uprobe_rcu(vm_inode, offset); > > + if (!uprobe) > > + return NULL; > > + > > + /* now double check that nothing about MM and VMA changed */ > > + if (!mm_end_vma_speculation(vma, mm_lock_seq)) > > + return NULL; > > + > > + /* happy case, we speculated successfully */ > > + return uprobe; > > +} > > +#else /* !CONFIG_PER_VMA_LOCK */ > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr) > > +{ > > + return NULL; > > +} > > +#endif /* CONFIG_PER_VMA_LOCK */ > > + > > /* assumes being inside RCU protected region */ > > static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, > > int *is_swbp) > > { > > @@ -2251,6 +2315,10 @@ static struct uprobe > > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb > > struct uprobe *uprobe = NULL; > > struct vm_area_struct *vma; > > > > + uprobe = find_active_uprobe_speculative(bp_vaddr); > > + if (uprobe) > > + return uprobe; > > + > > mmap_read_lock(mm); > > vma = vma_lookup(mm, bp_vaddr); > > if (vma) { > > diff --git a/kernel/fork.c b/kernel/fork.c > > index cc760491f201..211a84ee92b4 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void) > > NULL); > > files_cachep = kmem_cache_create("files_cache", > > sizeof(struct files_struct), 0, > > - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, > > + > > SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU, > > NULL); > > fs_cachep = kmem_cache_create("fs_cache", > > sizeof(struct fs_struct), 0, > > > > > > > > > > P.S. This is basically the last big blocker towards linear uprobes > > > scalability with the number of active CPUs. I have > > > uretprobe+SRCU+timeout implemented and it seems to work fine, will > > > post soon-ish. > > > > > > P.P.S Also, funny enough, below was another big scalability limiter > > > (and the last one) :) I'm not sure if we can just drop it, or I should > > > use per-CPU counter, but with the below change and speculative VMA > > > lookup (however buggy, works ok for benchmarking), I finally get > > > linear scaling of uprobe triggering throughput with number of CPUs. We > > > are very close. > > > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > > index f7443e996b1b..64c2bc316a08 100644 > > > --- a/kernel/trace/trace_uprobe.c > > > +++ b/kernel/trace/trace_uprobe.c > > > @@ -1508,7 +1508,7 @@ static int uprobe_dispatcher(struct > > > uprobe_consumer *con, struct pt_regs *regs) > > > int ret = 0; > > > > > > tu = container_of(con, struct trace_uprobe, consumer); > > > - tu->nhit++; > > > + //tu->nhit++; > > > > > > udd.tu = tu; > > > udd.bp_addr = instruction_pointer(regs); > > > > > > > > > > > + > > > > > + /* happy case, we speculated successfully */ > > > > > + rcu_read_unlock(); > > > > > + return uprobe; > > > > > + > > > > > +retry_with_lock: > > > > > + rcu_read_unlock(); > > > > > + uprobe = NULL; > > > > > +#endif > > > > > + > > > > > mmap_read_lock(mm); > > > > > vma = vma_lookup(mm, bp_vaddr); > > > > > if (vma) { > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > index cc760491f201..211a84ee92b4 100644 > > > > > --- a/kernel/fork.c > > > > > +++ b/kernel/fork.c > > > > > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void) > > > > > NULL); > > > > > files_cachep = kmem_cache_create("files_cache", > > > > > sizeof(struct files_struct), 0, > > > > > - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, > > > > > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU, > > > > > NULL); > > > > > fs_cachep = kmem_cache_create("fs_cache", > > > > > sizeof(struct fs_struct), 0,