Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 1 Aug 2024 15:07:15 -0700

On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> Revamp how struct uprobe is refcounted, and thus how its lifetime is
> managed.
>
> Right now, there are a few possible "owners" of uprobe refcount:
>   - uprobes_tree RB tree assumes one refcount when uprobe is registered
>     and added to the lookup tree;
>   - while uprobe is triggered and kernel is handling it in the breakpoint
>     handler code, temporary refcount bump is done to keep uprobe from
>     being freed;
>   - if we have uretprobe requested on a given struct uprobe instance, we
>     take another refcount to keep uprobe alive until user space code
>     returns from the function and triggers return handler.
>
> The uprobe_tree's extra refcount of 1 is confusing and problematic. No
> matter how many actual consumers are attached, they all share the same
> refcount, and we have an extra logic to drop the "last" (which might not
> really be last) refcount once uprobe's consumer list becomes empty.
>
> This is unconventional and has to be kept in mind as a special case all
> the time. Further, because of this design we have the situations where
> find_uprobe() will find uprobe, bump refcount, return it to the caller,
> but that uprobe will still need uprobe_is_active() check, after which
> the caller is required to drop refcount and try again. This is just too
> many details leaking to the higher level logic.
>
> This patch changes refcounting scheme in such a way as to not have
> uprobes_tree keeping extra refcount for struct uprobe. Instead, each
> uprobe_consumer is assuming its own refcount, which will be dropped
> when consumer is unregistered. Other than that, all the active users of
> uprobe (entry and return uprobe handling code) keeps exactly the same
> refcounting approach.
>
> With the above setup, once uprobe's refcount drops to zero, we need to
> make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
> of course. This, though, races with uprobe entry handling code in
> handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
> can race with uprobe being destroyed after refcount drops to zero (e.g.,
> due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
> will attempt to bump refcount, unless it already is zero. Caller needs
> to guarantee that uprobe instance won't be freed in parallel, which is
> the case while we keep uprobes_treelock (for read or write, doesn't
> matter).
>
> Note also, we now don't leak the race between registration and
> unregistration, so we remove the retry logic completely. If
> find_uprobe() returns valid uprobe, it's guaranteed to remain in
> uprobes_tree with properly incremented refcount. The race is handled
> inside __insert_uprobe() and put_uprobe() working together:
> __insert_uprobe() will remove uprobe from RB-tree, if it can't bump
> refcount and will retry to insert the new uprobe instance. put_uprobe()
> won't attempt to remove uprobe from RB-tree, if it's already not there.
> All that is protected by uprobes_treelock, which keeps things simple.
>
> Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> ---
>  kernel/events/uprobes.c | 163 +++++++++++++++++++++++-----------------
>  1 file changed, 93 insertions(+), 70 deletions(-)
>

[...]

> @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>         int err;
>
>         down_write(&uprobe->register_rwsem);
> -       if (WARN_ON(!consumer_del(uprobe, uc)))
> +       if (WARN_ON(!consumer_del(uprobe, uc))) {
>                 err = -ENOENT;
> -       else
> +       } else {
>                 err = register_for_each_vma(uprobe, NULL);
> -
> -       /* TODO : cant unregister? schedule a worker thread */
> -       if (!err) {
> -               if (!uprobe->consumers)
> -                       delete_uprobe(uprobe);
> -               else
> -                       err = -EBUSY;
> +               /* TODO : cant unregister? schedule a worker thread */
> +               WARN(err, "leaking uprobe due to failed unregistration");

Ok, so now that I added this very loud warning if
register_for_each_vma(uprobe, NULL) returns error, it turns out it's
not that unusual for this unregistration to fail. If I run my
uprobe-stress for just a little bit, and then terminate it with ^C, I
get this splat:

[ 1980.854229] leaking uprobe due to failed unregistration
[ 1980.854244] WARNING: CPU: 3 PID: 23013 at
kernel/events/uprobes.c:1123 uprobe_unregister_nosync+0x68/0x80
[ 1980.855356] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) floppy(E) i2c_piix4(E) i2c_]
[ 1980.856746] CPU: 3 UID: 0 PID: 23013 Comm: exe Tainted: G        W
OE      6.11.0-rc1-00032-g308d1f294b79 #129
[ 1980.857407] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1980.857788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04
[ 1980.858489] RIP: 0010:uprobe_unregister_nosync+0x68/0x80
[ 1980.858826] Code: 6e fb ff ff 4c 89 e7 89 c3 e8 24 e8 e3 ff 85 db
75 0c 5b 48 89 ef 5d 41 5c e9 84 e5 ff ff 48 c7 c7 d0
[ 1980.860052] RSP: 0018:ffffc90002fb7e58 EFLAGS: 00010296
[ 1980.860428] RAX: 000000000000002b RBX: 00000000fffffffc RCX: 0000000000000000
[ 1980.860913] RDX: 0000000000000002 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1980.861379] RBP: ffff88811159ac00 R08: 00000000fffeffff R09: 0000000000000001
[ 1980.861871] R10: 0000000000000000 R11: ffffffff83299920 R12: ffff88811159ac20
[ 1980.862340] R13: ffff88810153c7a0 R14: ffff88810c3fe000 R15: 0000000000000000
[ 1980.862830] FS:  0000000000000000(0000) GS:ffff88881ca00000(0000)
knlGS:0000000000000000
[ 1980.863370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1980.863758] CR2: 00007fa08aea8276 CR3: 000000010f59c005 CR4: 0000000000370ef0
[ 1980.864239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1980.864708] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1980.865202] Call Trace:
[ 1980.865356]  <TASK>
[ 1980.865524]  ? __warn+0x80/0x180
[ 1980.865745]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.866074]  ? report_bug+0x18d/0x1c0
[ 1980.866326]  ? handle_bug+0x3a/0x70
[ 1980.866568]  ? exc_invalid_op+0x13/0x60
[ 1980.866836]  ? asm_exc_invalid_op+0x16/0x20
[ 1980.867098]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.867390]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.867726]  bpf_uprobe_multi_link_release+0x31/0xd0
[ 1980.868044]  bpf_link_free+0x54/0xd0
[ 1980.868267]  bpf_link_release+0x17/0x20
[ 1980.868542]  __fput+0x102/0x2e0
[ 1980.868760]  task_work_run+0x55/0xa0
[ 1980.869027]  syscall_exit_to_user_mode+0x1dd/0x1f0
[ 1980.869344]  do_syscall_64+0x70/0x140
[ 1980.869603]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1980.869923] RIP: 0033:0x7fa08aea82a0
[ 1980.870171] Code: Unable to access opcode bytes at 0x7fa08aea8276.
[ 1980.870587] RSP: 002b:00007ffe838cd030 EFLAGS: 00000202 ORIG_RAX:
000000000000003b
[ 1980.871098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1980.871563] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1980.872055] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1980.872526] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1980.873044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1980.873568]  </TASK>

I traced it a little bit with retsnoop to figure out where this is
coming from, and here we go:

14:53:18.897165 -> 14:53:18.897171 TID/PID 23013/23013 (exe/exe):

FUNCTION CALLS                RESULT    DURATION
---------------------------   --------  --------
→ uprobe_write_opcode
    → get_user_pages_remote
        ↔ __get_user_pages    [-EINTR]   1.382us
    ← get_user_pages_remote   [-EINTR]   4.889us
← uprobe_write_opcode         [-EINTR]   6.908us

                   entry_SYSCALL_64_after_hwframe+0x76
(entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130)
                   do_syscall_64+0x70
(arch/x86/entry/common.c:102)
                   syscall_exit_to_user_mode+0x1dd
(kernel/entry/common.c:218)
                   . __syscall_exit_to_user_mode_work
(kernel/entry/common.c:207)
                   . exit_to_user_mode_prepare
(include/linux/entry-common.h:328)
                   . exit_to_user_mode_loop
(kernel/entry/common.c:114)
                   . resume_user_mode_work
(include/linux/resume_user_mode.h:50)
                   task_work_run+0x55                   (kernel/task_work.c:222)
                   __fput+0x102                         (fs/file_table.c:422)
                   bpf_link_release+0x17
(kernel/bpf/syscall.c:3116)
                   bpf_link_free+0x54
(kernel/bpf/syscall.c:3067)
                   bpf_uprobe_multi_link_release+0x31
(kernel/trace/bpf_trace.c:3198)
                   . bpf_uprobe_unregister
(kernel/trace/bpf_trace.c:3186)
                   uprobe_unregister_nosync+0x42
(kernel/events/uprobes.c:1120)
                   register_for_each_vma+0x427
(kernel/events/uprobes.c:1092)
     6us [-EINTR]  uprobe_write_opcode+0x79
(kernel/events/uprobes.c:478)
                   . get_user_page_vma_remote
(include/linux/mm.h:2489)
     4us [-EINTR]  get_user_pages_remote+0x109          (mm/gup.c:2627)
                   . __get_user_pages_locked            (mm/gup.c:1762)
!    1us [-EINTR]  __get_user_pages

So, we do uprobe_unregister -> register_for_each_vma ->
remove_breakpoint -> set_orig_insn -> uprobe_write_opcode ->
get_user_page_vma_remote -> get_user_pages_remote ->
__get_user_pages_locked -> __get_user_pages and I think we then hit
`if (fatal_signal_pending(current)) return -EINTR;` check.

So, is there something smarter we can do in this case besides leaking
an uprobe (and note, my changes don't change this behavior)?

I can of course just drop the WARN given it's sort of expected now,
but if we can handle that more gracefully it would be good. I don't
think that should block optimization work, but just something to keep
in mind and maybe fix as a follow up.

>         }
>         up_write(&uprobe->register_rwsem);
>

[...]