Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 7 Aug 2024 10:14:31 -0700

On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> Another big bottleneck to scalablity is uprobe_treelock that's taken in
> a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
> take advantage of that and make uprobes_tree RB-tree look up lockless.
>
> To make RB-tree RCU-protected lockless lookup correct, we need to take
> into account that such RB-tree lookup can return false negatives if there
> are parallel RB-tree modifications (rotations) going on. We use seqcount
> lock to detect whether RB-tree changed, and if we find nothing while
> RB-tree got modified inbetween, we just retry. If uprobe was found, then
> it's guaranteed to be a correct lookup.
>
> With all the lock-avoiding changes done, we get a pretty decent
> improvement in performance and scalability of uprobes with number of
> CPUs, even though we are still nowhere near linear scalability. This is
> due to SRCU not really scaling very well with number of CPUs on
> a particular hardware that was used for testing (80-core Intel Xeon Gold
> 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
> currently taken to resolve interrupt address to inode+offset and then
> uprobe instance. And, of course, uretprobes still need similar RCU to
> avoid refcount in the hot path, which will be addressed in the follow up
> patches.
>
> Nevertheless, the improvement is good. We used BPF selftest-based
> uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
> varying number of CPUs on which uprobes and uretprobes are triggered.
>
> BASELINE
> ========
> uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
> uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
> uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
> uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
> uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
> uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
> uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)
>
> uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
> uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
> uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
> uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
> uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
> uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
> uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)
>
> SRCU CHANGES
> ============
> uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
> uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
> uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
> uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
> uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
> uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
> uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)
>
> uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
> uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
> uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
> uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
> uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
> uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
> uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)
>
> Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
> 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
> to 5mln/s.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> ---
>  kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
>

Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not
enough or are buggy. I managed to more or less reliably start
reproducing a crash, which was bisected to exactly this change. My
wild guess is that we'd need an rb_erase_rcu() variant or something,
because what I'm seeing is a corrupted rb_root node while performing
lockless rb_find_rcu() operation.

You can find below debugging info (which Gmail will butcher here) in
[0], for convenience.

So, I got the below crash when running `sudo ./uprobe-stress -a20 -f3
-m5 -t4` on a 16-core QEMU VM.

[  179.375551] BUG: unable to handle page fault for address: 0000441f0f660097
[  179.376612] #PF: supervisor read access in kernel mode
[  179.377334] #PF: error_code(0x0000) - not-present page
[  179.378037] PGD 0 P4D 0
[  179.378391] Oops: Oops: 0000 [#1] PREEMPT SMP
[  179.378992] CPU: 5 UID: 0 PID: 2292 Comm: uprobe-stress Tainted: G
          E      6.11.0-rc1-00025-g6f8e0d8d5b55-dirty #181
[  179.380514] Tainted: [E]=UNSIGNED_MODULE
[  179.381022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  179.382475] RIP: 0010:uprobe_notify_resume+0x3db/0xc40
[  179.383148] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35
bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85
ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d
98 00
[  179.385639] RSP: 0000:ffffc90000a93e78 EFLAGS: 00010202
[  179.386338] RAX: 74c2394808478d48 RBX: ffff888105619800 RCX: 0000000000004118
[  179.387480] RDX: ffff888109c18eb0 RSI: 00000000000a6130 RDI: ffff888105619800
[  179.388677] RBP: 0000441f0f660007 R08: ffff8881098f1300 R09: 0000000000000000
[  179.389729] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888105680fc0
[  179.390694] R13: 0000000000000000 R14: ffff888105681068 R15: ffff888105681000
[  179.391717] FS:  00007f10690006c0(0000) GS:ffff88881fb40000(0000)
knlGS:0000000000000000
[  179.392800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  179.393582] CR2: 0000441f0f660097 CR3: 00000001049bb004 CR4: 0000000000370ef0
[  179.394536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  179.395485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  179.396464] Call Trace:
[  179.396796]  <TASK>
[  179.397133]  ? __die+0x1f/0x60
[  179.397640]  ? page_fault_oops+0x14c/0x440
[  179.398316]  ? do_user_addr_fault+0x5f/0x6c0
[  179.398899]  ? kvm_read_and_reset_apf_flags+0x3c/0x50
[  179.399730]  ? exc_page_fault+0x66/0x130
[  179.400361]  ? asm_exc_page_fault+0x22/0x30
[  179.400912]  ? uprobe_notify_resume+0x3db/0xc40
[  179.401500]  ? uprobe_notify_resume+0x11a/0xc40
[  179.402089]  ? arch_uprobe_exception_notify+0x39/0x40
[  179.402736]  ? notifier_call_chain+0x55/0xc0
[  179.403293]  irqentry_exit_to_user_mode+0x98/0x140
[  179.403917]  asm_exc_int3+0x35/0x40
[  179.404394] RIP: 0033:0x404119
[  179.404831] Code: 8b 45 fc 89 c7 e8 43 09 00 00 83 c0 01 c9 c3 55
48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 29 09 00 00 83 c0 01
c9 c3 cc <48> 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 0f 09 00 00
83 c0
[  179.407350] RSP: 002b:00007f1068fffab8 EFLAGS: 00000206
[  179.408083] RAX: 0000000000404118 RBX: 00007f1069000cdc RCX: 000000000000000a
[  179.409020] RDX: 0000000000000012 RSI: 0000000000000064 RDI: 0000000000000012
[  179.409965] RBP: 00007f1068fffae0 R08: 00007f106a84403c R09: 00007f106a8440a0
[  179.410864] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff80
[  179.411768] R13: 0000000000000016 R14: 00007ffc3dc8f070 R15: 00007f1068800000
[  179.412672]  </TASK>
[  179.412965] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) i2c_piix4(E) i2c_smbus(E) i2c_core(E)
i6300esb(E) crc32c_intel(E) floppy(E) pcspkr(E) button(E) serio_raw(E)
[  179.415227] CR2: 0000441f0f660097
[  179.415683] ---[ end trace 0000000000000000 ]---

Note RBP: 0000441f0f660007 and RIP: 0010:uprobe_notify_resume+0x3db/0xc40

Decoding:

[ 179.417075] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb
36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed
74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98
00
All code
========
0: c1 e1 0c                shl $0xc,%ecx
3: 48 2b 08                sub (%rax),%rcx
6: 48 8b 44 24 08          mov 0x8(%rsp),%rax
b: 48 01 c1                add %rax,%rcx
e: 8b 35 bb 36 0d 03       mov 0x30d36bb(%rip),%esi # 0x30d36cf
14: 40 f6 c6 01            test $0x1,%sil
18: 0f 85 be 00 00 00      jne 0xdc
1e: 48 8b 2d ba 36 0d 03   mov 0x30d36ba(%rip),%rbp # 0x30d36df
25: 48 85 ed               test %rbp,%rbp
28: 74 29                  je 0x53
2a:* 48 8b 85 90 00 00 00  mov 0x90(%rbp),%rax <-- trapping instruction
31: 48 39 c2               cmp %rax,%rdx
34: 72 2d                  jb 0x63
36: 48 39 d0               cmp %rdx,%rax
39: 72 0f                  jb 0x4a
3b: 48                     rex.W
3c: 3b                     .byte 0x3b
3d: 8d                     .byte 0x8d
3e: 98                     cwtl
...

Let's also look at uprobe_notify_resume+0x3db:

(gdb) list *(uprobe_notify_resume+0x3db)
0xffffffff81242f5b is in uprobe_notify_resume
(/data/users/andriin/linux/kernel/events/uprobes.c:662).
657
658 static __always_inline
659 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
660                const struct uprobe *r)
661 {
662     if (l_inode < r->inode)
663         return -1;
664
665     if (l_inode > r->inode)
666         return 1;

So this is most probably when accessing uprobe through r->inode.

Looking at uprobe_notify_resume disassembly:

0xffffffff81242f2c <+940>:  mov 0x78(%rax),%rcx
0xffffffff81242f30 <+944>:  shl $0xc,%rcx
0xffffffff81242f34 <+948>:  sub (%rax),%rcx
0xffffffff81242f37 <+951>:  mov 0x8(%rsp),%rax
0xffffffff81242f3c <+956>:  add %rax,%rcx
0xffffffff81242f3f <+959>:  mov 0x30d36bb(%rip),%esi #
0xffffffff84316600 <uprobes_seqcount>
0xffffffff81242f45 <+965>:  test $0x1,%sil
0xffffffff81242f49 <+969>:  jne 0xffffffff8124300d <uprobe_notify_resume+1165>
0xffffffff81242f4f <+975>:  mov 0x30d36ba(%rip),%rbp #
0xffffffff84316610 <uprobes_tree>
0xffffffff81242f56 <+982>:  test %rbp,%rbp
0xffffffff81242f59 <+985>:  je 0xffffffff81242f84 <uprobe_notify_resume+1028>
0xffffffff81242f5b <+987>:  mov 0x90(%rbp),%rax
0xffffffff81242f62 <+994>:  cmp %rax,%rdx
0xffffffff81242f65 <+997>:  jb 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f67 <+999>:  cmp %rdx,%rax
0xffffffff81242f6a <+1002>: jb 0xffffffff81242f7b <uprobe_notify_resume+1019>
0xffffffff81242f6c <+1004>: cmp 0x98(%rbp),%rcx
0xffffffff81242f73 <+1011>: jl 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f75 <+1013>: jle 0xffffffff81242d0e <uprobe_notify_resume+398>
0xffffffff81242f7b <+1019>: mov 0x8(%rbp),%rbp
0xffffffff81242f7f <+1023>: test %rbp,%rbp
0xffffffff81242f25 <+933>:  mov 0xa8(%rcx),%rdx

Not how in mov 0x90(%rbp),%rax %rbp is coming from

mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree>

So. This is inlined uprobe_cmp() on uprobes_tree.rb_node (i.e., the
root of RB tree). And we know that rbp was set to 0x0000441f0f660097,
which doesn't look like a valid kernel address to me. 0x90 is
offsetof(struct uprobe, inode), and offsetof(struct uprobe, rb_node)
== 0. So we load the rb_node/uprobe pointer from uprobe_tree (struct
rb_root), get 0x0000441f0f660097, and then r->inode crashes the kernel
because r looks like user space address.

Note that we never modify RB tree root (or any node) directly
anywhere. We only mutate it with rb_find_add_rcu() and rb_erase(),
both under uprobes_treelock.

So, any ideas how we can end up with "corrupted" root on lockless
lookup with rb_find_rcu()? This seems to be the very first lockless
RB-tree lookup use case in the tree, so perhaps it's not that safe
after all (certainly rb_erase() is non-trivial enough to not be
"obviously correct" w.r.t. RCU, no)?

  [0] https://gist.github.com/anakryiko/df31ab75f25544af93cd41273056ee88

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index b0488d356399..d03962cc96de 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  #define no_uprobe_events()     RB_EMPTY_ROOT(&uprobes_tree)
>
>  static DEFINE_RWLOCK(uprobes_treelock);        /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
>  DEFINE_STATIC_SRCU(uprobes_srcu);
>
> @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe)
>
>         write_lock(&uprobes_treelock);
>
> -       if (uprobe_is_active(uprobe))
> +       if (uprobe_is_active(uprobe)) {
> +               write_seqcount_begin(&uprobes_seqcount);
>                 rb_erase(&uprobe->rb_node, &uprobes_tree);
> +               write_seqcount_end(&uprobes_seqcount);
> +       }
>
>         write_unlock(&uprobes_treelock);
>
> @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
>                 .offset = offset,
>         };
>         struct rb_node *node;
> +       unsigned int seq;
>
>         lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
>
> -       read_lock(&uprobes_treelock);
> -       node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> -       read_unlock(&uprobes_treelock);
> +       do {
> +               seq = read_seqcount_begin(&uprobes_seqcount);
> +               node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
> +               /*
> +                * Lockless RB-tree lookups can result only in false negatives.
> +                * If the element is found, it is correct and can be returned
> +                * under RCU protection. If we find nothing, we need to
> +                * validate that seqcount didn't change. If it did, we have to
> +                * try again as we might have missed the element (false
> +                * negative). If seqcount is unchanged, search truly failed.
> +                */
> +               if (node)
> +                       return __node_2_uprobe(node);
> +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
>
> -       return node ? __node_2_uprobe(node) : NULL;
> +       return NULL;
>  }
>
>  /*
> @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
>  {
>         struct rb_node *node;
>  again:
> -       node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> +       node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
>         if (node) {
>                 struct uprobe *u = __node_2_uprobe(node);
>
> @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>         struct uprobe *u;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         u = __insert_uprobe(uprobe);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>
>         return u;
> --
> 2.43.0
>