On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > Another big bottleneck to scalablity is uprobe_treelock that's taken in > a very hot path in handle_swbp(). Now that uprobes are SRCU-protected, > take advantage of that and make uprobes_tree RB-tree look up lockless. > > To make RB-tree RCU-protected lockless lookup correct, we need to take > into account that such RB-tree lookup can return false negatives if there > are parallel RB-tree modifications (rotations) going on. We use seqcount > lock to detect whether RB-tree changed, and if we find nothing while > RB-tree got modified inbetween, we just retry. If uprobe was found, then > it's guaranteed to be a correct lookup. > > With all the lock-avoiding changes done, we get a pretty decent > improvement in performance and scalability of uprobes with number of > CPUs, even though we are still nowhere near linear scalability. This is > due to SRCU not really scaling very well with number of CPUs on > a particular hardware that was used for testing (80-core Intel Xeon Gold > 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is > currently taken to resolve interrupt address to inode+offset and then > uprobe instance. And, of course, uretprobes still need similar RCU to > avoid refcount in the hot path, which will be addressed in the follow up > patches. > > Nevertheless, the improvement is good. We used BPF selftest-based > uprobe-nop and uretprobe-nop benchmarks to get the below numbers, > varying number of CPUs on which uprobes and uretprobes are triggered. > > BASELINE > ======== > uprobe-nop ( 1 cpus): 3.032 ± 0.023M/s ( 3.032M/s/cpu) > uprobe-nop ( 2 cpus): 3.452 ± 0.005M/s ( 1.726M/s/cpu) > uprobe-nop ( 4 cpus): 3.663 ± 0.005M/s ( 0.916M/s/cpu) > uprobe-nop ( 8 cpus): 3.718 ± 0.038M/s ( 0.465M/s/cpu) > uprobe-nop (16 cpus): 3.344 ± 0.008M/s ( 0.209M/s/cpu) > uprobe-nop (32 cpus): 2.288 ± 0.021M/s ( 0.071M/s/cpu) > uprobe-nop (64 cpus): 3.205 ± 0.004M/s ( 0.050M/s/cpu) > > uretprobe-nop ( 1 cpus): 1.979 ± 0.005M/s ( 1.979M/s/cpu) > uretprobe-nop ( 2 cpus): 2.361 ± 0.005M/s ( 1.180M/s/cpu) > uretprobe-nop ( 4 cpus): 2.309 ± 0.002M/s ( 0.577M/s/cpu) > uretprobe-nop ( 8 cpus): 2.253 ± 0.001M/s ( 0.282M/s/cpu) > uretprobe-nop (16 cpus): 2.007 ± 0.000M/s ( 0.125M/s/cpu) > uretprobe-nop (32 cpus): 1.624 ± 0.003M/s ( 0.051M/s/cpu) > uretprobe-nop (64 cpus): 2.149 ± 0.001M/s ( 0.034M/s/cpu) > > SRCU CHANGES > ============ > uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu) > uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu) > uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu) > uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu) > uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu) > uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu) > uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu) > > uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu) > uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu) > uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu) > uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu) > uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu) > uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu) > uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu) > > Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about > 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s > to 5mln/s. > > Suggested-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > --- > kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------ > 1 file changed, 24 insertions(+), 6 deletions(-) > Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not enough or are buggy. I managed to more or less reliably start reproducing a crash, which was bisected to exactly this change. My wild guess is that we'd need an rb_erase_rcu() variant or something, because what I'm seeing is a corrupted rb_root node while performing lockless rb_find_rcu() operation. You can find below debugging info (which Gmail will butcher here) in [0], for convenience. So, I got the below crash when running `sudo ./uprobe-stress -a20 -f3 -m5 -t4` on a 16-core QEMU VM. [ 179.375551] BUG: unable to handle page fault for address: 0000441f0f660097 [ 179.376612] #PF: supervisor read access in kernel mode [ 179.377334] #PF: error_code(0x0000) - not-present page [ 179.378037] PGD 0 P4D 0 [ 179.378391] Oops: Oops: 0000 [#1] PREEMPT SMP [ 179.378992] CPU: 5 UID: 0 PID: 2292 Comm: uprobe-stress Tainted: G E 6.11.0-rc1-00025-g6f8e0d8d5b55-dirty #181 [ 179.380514] Tainted: [E]=UNSIGNED_MODULE [ 179.381022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 179.382475] RIP: 0010:uprobe_notify_resume+0x3db/0xc40 [ 179.383148] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98 00 [ 179.385639] RSP: 0000:ffffc90000a93e78 EFLAGS: 00010202 [ 179.386338] RAX: 74c2394808478d48 RBX: ffff888105619800 RCX: 0000000000004118 [ 179.387480] RDX: ffff888109c18eb0 RSI: 00000000000a6130 RDI: ffff888105619800 [ 179.388677] RBP: 0000441f0f660007 R08: ffff8881098f1300 R09: 0000000000000000 [ 179.389729] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888105680fc0 [ 179.390694] R13: 0000000000000000 R14: ffff888105681068 R15: ffff888105681000 [ 179.391717] FS: 00007f10690006c0(0000) GS:ffff88881fb40000(0000) knlGS:0000000000000000 [ 179.392800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 179.393582] CR2: 0000441f0f660097 CR3: 00000001049bb004 CR4: 0000000000370ef0 [ 179.394536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 179.395485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 179.396464] Call Trace: [ 179.396796] <TASK> [ 179.397133] ? __die+0x1f/0x60 [ 179.397640] ? page_fault_oops+0x14c/0x440 [ 179.398316] ? do_user_addr_fault+0x5f/0x6c0 [ 179.398899] ? kvm_read_and_reset_apf_flags+0x3c/0x50 [ 179.399730] ? exc_page_fault+0x66/0x130 [ 179.400361] ? asm_exc_page_fault+0x22/0x30 [ 179.400912] ? uprobe_notify_resume+0x3db/0xc40 [ 179.401500] ? uprobe_notify_resume+0x11a/0xc40 [ 179.402089] ? arch_uprobe_exception_notify+0x39/0x40 [ 179.402736] ? notifier_call_chain+0x55/0xc0 [ 179.403293] irqentry_exit_to_user_mode+0x98/0x140 [ 179.403917] asm_exc_int3+0x35/0x40 [ 179.404394] RIP: 0033:0x404119 [ 179.404831] Code: 8b 45 fc 89 c7 e8 43 09 00 00 83 c0 01 c9 c3 55 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 29 09 00 00 83 c0 01 c9 c3 cc <48> 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 0f 09 00 00 83 c0 [ 179.407350] RSP: 002b:00007f1068fffab8 EFLAGS: 00000206 [ 179.408083] RAX: 0000000000404118 RBX: 00007f1069000cdc RCX: 000000000000000a [ 179.409020] RDX: 0000000000000012 RSI: 0000000000000064 RDI: 0000000000000012 [ 179.409965] RBP: 00007f1068fffae0 R08: 00007f106a84403c R09: 00007f106a8440a0 [ 179.410864] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff80 [ 179.411768] R13: 0000000000000016 R14: 00007ffc3dc8f070 R15: 00007f1068800000 [ 179.412672] </TASK> [ 179.412965] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) i2c_piix4(E) i2c_smbus(E) i2c_core(E) i6300esb(E) crc32c_intel(E) floppy(E) pcspkr(E) button(E) serio_raw(E) [ 179.415227] CR2: 0000441f0f660097 [ 179.415683] ---[ end trace 0000000000000000 ]--- Note RBP: 0000441f0f660007 and RIP: 0010:uprobe_notify_resume+0x3db/0xc40 Decoding: [ 179.417075] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98 00 All code ======== 0: c1 e1 0c shl $0xc,%ecx 3: 48 2b 08 sub (%rax),%rcx 6: 48 8b 44 24 08 mov 0x8(%rsp),%rax b: 48 01 c1 add %rax,%rcx e: 8b 35 bb 36 0d 03 mov 0x30d36bb(%rip),%esi # 0x30d36cf 14: 40 f6 c6 01 test $0x1,%sil 18: 0f 85 be 00 00 00 jne 0xdc 1e: 48 8b 2d ba 36 0d 03 mov 0x30d36ba(%rip),%rbp # 0x30d36df 25: 48 85 ed test %rbp,%rbp 28: 74 29 je 0x53 2a:* 48 8b 85 90 00 00 00 mov 0x90(%rbp),%rax <-- trapping instruction 31: 48 39 c2 cmp %rax,%rdx 34: 72 2d jb 0x63 36: 48 39 d0 cmp %rdx,%rax 39: 72 0f jb 0x4a 3b: 48 rex.W 3c: 3b .byte 0x3b 3d: 8d .byte 0x8d 3e: 98 cwtl ... Let's also look at uprobe_notify_resume+0x3db: (gdb) list *(uprobe_notify_resume+0x3db) 0xffffffff81242f5b is in uprobe_notify_resume (/data/users/andriin/linux/kernel/events/uprobes.c:662). 657 658 static __always_inline 659 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset, 660 const struct uprobe *r) 661 { 662 if (l_inode < r->inode) 663 return -1; 664 665 if (l_inode > r->inode) 666 return 1; So this is most probably when accessing uprobe through r->inode. Looking at uprobe_notify_resume disassembly: 0xffffffff81242f2c <+940>: mov 0x78(%rax),%rcx 0xffffffff81242f30 <+944>: shl $0xc,%rcx 0xffffffff81242f34 <+948>: sub (%rax),%rcx 0xffffffff81242f37 <+951>: mov 0x8(%rsp),%rax 0xffffffff81242f3c <+956>: add %rax,%rcx 0xffffffff81242f3f <+959>: mov 0x30d36bb(%rip),%esi # 0xffffffff84316600 <uprobes_seqcount> 0xffffffff81242f45 <+965>: test $0x1,%sil 0xffffffff81242f49 <+969>: jne 0xffffffff8124300d <uprobe_notify_resume+1165> 0xffffffff81242f4f <+975>: mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree> 0xffffffff81242f56 <+982>: test %rbp,%rbp 0xffffffff81242f59 <+985>: je 0xffffffff81242f84 <uprobe_notify_resume+1028> 0xffffffff81242f5b <+987>: mov 0x90(%rbp),%rax 0xffffffff81242f62 <+994>: cmp %rax,%rdx 0xffffffff81242f65 <+997>: jb 0xffffffff81242f94 <uprobe_notify_resume+1044> 0xffffffff81242f67 <+999>: cmp %rdx,%rax 0xffffffff81242f6a <+1002>: jb 0xffffffff81242f7b <uprobe_notify_resume+1019> 0xffffffff81242f6c <+1004>: cmp 0x98(%rbp),%rcx 0xffffffff81242f73 <+1011>: jl 0xffffffff81242f94 <uprobe_notify_resume+1044> 0xffffffff81242f75 <+1013>: jle 0xffffffff81242d0e <uprobe_notify_resume+398> 0xffffffff81242f7b <+1019>: mov 0x8(%rbp),%rbp 0xffffffff81242f7f <+1023>: test %rbp,%rbp 0xffffffff81242f25 <+933>: mov 0xa8(%rcx),%rdx Not how in mov 0x90(%rbp),%rax %rbp is coming from mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree> So. This is inlined uprobe_cmp() on uprobes_tree.rb_node (i.e., the root of RB tree). And we know that rbp was set to 0x0000441f0f660097, which doesn't look like a valid kernel address to me. 0x90 is offsetof(struct uprobe, inode), and offsetof(struct uprobe, rb_node) == 0. So we load the rb_node/uprobe pointer from uprobe_tree (struct rb_root), get 0x0000441f0f660097, and then r->inode crashes the kernel because r looks like user space address. Note that we never modify RB tree root (or any node) directly anywhere. We only mutate it with rb_find_add_rcu() and rb_erase(), both under uprobes_treelock. So, any ideas how we can end up with "corrupted" root on lockless lookup with rb_find_rcu()? This seems to be the very first lockless RB-tree lookup use case in the tree, so perhaps it's not that safe after all (certainly rb_erase() is non-trivial enough to not be "obviously correct" w.r.t. RCU, no)? [0] https://gist.github.com/anakryiko/df31ab75f25544af93cd41273056ee88 > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index b0488d356399..d03962cc96de 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); > > DEFINE_STATIC_SRCU(uprobes_srcu); > > @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe) > > write_lock(&uprobes_treelock); > > - if (uprobe_is_active(uprobe)) > + if (uprobe_is_active(uprobe)) { > + write_seqcount_begin(&uprobes_seqcount); > rb_erase(&uprobe->rb_node, &uprobes_tree); > + write_seqcount_end(&uprobes_seqcount); > + } > > write_unlock(&uprobes_treelock); > > @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset) > .offset = offset, > }; > struct rb_node *node; > + unsigned int seq; > > lockdep_assert(srcu_read_lock_held(&uprobes_srcu)); > > - read_lock(&uprobes_treelock); > - node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); > - read_unlock(&uprobes_treelock); > + do { > + seq = read_seqcount_begin(&uprobes_seqcount); > + node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key); > + /* > + * Lockless RB-tree lookups can result only in false negatives. > + * If the element is found, it is correct and can be returned > + * under RCU protection. If we find nothing, we need to > + * validate that seqcount didn't change. If it did, we have to > + * try again as we might have missed the element (false > + * negative). If seqcount is unchanged, search truly failed. > + */ > + if (node) > + return __node_2_uprobe(node); > + } while (read_seqcount_retry(&uprobes_seqcount, seq)); > > - return node ? __node_2_uprobe(node) : NULL; > + return NULL; > } > > /* > @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) > { > struct rb_node *node; > again: > - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); > + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); > if (node) { > struct uprobe *u = __node_2_uprobe(node); > > @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > struct uprobe *u; > > write_lock(&uprobes_treelock); > + write_seqcount_begin(&uprobes_seqcount); > u = __insert_uprobe(uprobe); > + write_seqcount_end(&uprobes_seqcount); > write_unlock(&uprobes_treelock); > > return u; > -- > 2.43.0 >