On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > Revamp how struct uprobe is refcounted, and thus how its lifetime is > managed. > > Right now, there are a few possible "owners" of uprobe refcount: > - uprobes_tree RB tree assumes one refcount when uprobe is registered > and added to the lookup tree; > - while uprobe is triggered and kernel is handling it in the breakpoint > handler code, temporary refcount bump is done to keep uprobe from > being freed; > - if we have uretprobe requested on a given struct uprobe instance, we > take another refcount to keep uprobe alive until user space code > returns from the function and triggers return handler. > > The uprobe_tree's extra refcount of 1 is confusing and problematic. No > matter how many actual consumers are attached, they all share the same > refcount, and we have an extra logic to drop the "last" (which might not > really be last) refcount once uprobe's consumer list becomes empty. > > This is unconventional and has to be kept in mind as a special case all > the time. Further, because of this design we have the situations where > find_uprobe() will find uprobe, bump refcount, return it to the caller, > but that uprobe will still need uprobe_is_active() check, after which > the caller is required to drop refcount and try again. This is just too > many details leaking to the higher level logic. > > This patch changes refcounting scheme in such a way as to not have > uprobes_tree keeping extra refcount for struct uprobe. Instead, each > uprobe_consumer is assuming its own refcount, which will be dropped > when consumer is unregistered. Other than that, all the active users of > uprobe (entry and return uprobe handling code) keeps exactly the same > refcounting approach. > > With the above setup, once uprobe's refcount drops to zero, we need to > make sure that uprobe's "destructor" removes uprobe from uprobes_tree, > of course. This, though, races with uprobe entry handling code in > handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup, > can race with uprobe being destroyed after refcount drops to zero (e.g., > due to uprobe_consumer unregistering). So we add try_get_uprobe(), which > will attempt to bump refcount, unless it already is zero. Caller needs > to guarantee that uprobe instance won't be freed in parallel, which is > the case while we keep uprobes_treelock (for read or write, doesn't > matter). > > Note also, we now don't leak the race between registration and > unregistration, so we remove the retry logic completely. If > find_uprobe() returns valid uprobe, it's guaranteed to remain in > uprobes_tree with properly incremented refcount. The race is handled > inside __insert_uprobe() and put_uprobe() working together: > __insert_uprobe() will remove uprobe from RB-tree, if it can't bump > refcount and will retry to insert the new uprobe instance. put_uprobe() > won't attempt to remove uprobe from RB-tree, if it's already not there. > All that is protected by uprobes_treelock, which keeps things simple. > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > --- > kernel/events/uprobes.c | 163 +++++++++++++++++++++++----------------- > 1 file changed, 93 insertions(+), 70 deletions(-) > [...] > @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > int err; > > down_write(&uprobe->register_rwsem); > - if (WARN_ON(!consumer_del(uprobe, uc))) > + if (WARN_ON(!consumer_del(uprobe, uc))) { > err = -ENOENT; > - else > + } else { > err = register_for_each_vma(uprobe, NULL); > - > - /* TODO : cant unregister? schedule a worker thread */ > - if (!err) { > - if (!uprobe->consumers) > - delete_uprobe(uprobe); > - else > - err = -EBUSY; > + /* TODO : cant unregister? schedule a worker thread */ > + WARN(err, "leaking uprobe due to failed unregistration"); Ok, so now that I added this very loud warning if register_for_each_vma(uprobe, NULL) returns error, it turns out it's not that unusual for this unregistration to fail. If I run my uprobe-stress for just a little bit, and then terminate it with ^C, I get this splat: [ 1980.854229] leaking uprobe due to failed unregistration [ 1980.854244] WARNING: CPU: 3 PID: 23013 at kernel/events/uprobes.c:1123 uprobe_unregister_nosync+0x68/0x80 [ 1980.855356] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) floppy(E) i2c_piix4(E) i2c_] [ 1980.856746] CPU: 3 UID: 0 PID: 23013 Comm: exe Tainted: G W OE 6.11.0-rc1-00032-g308d1f294b79 #129 [ 1980.857407] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 1980.857788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04 [ 1980.858489] RIP: 0010:uprobe_unregister_nosync+0x68/0x80 [ 1980.858826] Code: 6e fb ff ff 4c 89 e7 89 c3 e8 24 e8 e3 ff 85 db 75 0c 5b 48 89 ef 5d 41 5c e9 84 e5 ff ff 48 c7 c7 d0 [ 1980.860052] RSP: 0018:ffffc90002fb7e58 EFLAGS: 00010296 [ 1980.860428] RAX: 000000000000002b RBX: 00000000fffffffc RCX: 0000000000000000 [ 1980.860913] RDX: 0000000000000002 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1980.861379] RBP: ffff88811159ac00 R08: 00000000fffeffff R09: 0000000000000001 [ 1980.861871] R10: 0000000000000000 R11: ffffffff83299920 R12: ffff88811159ac20 [ 1980.862340] R13: ffff88810153c7a0 R14: ffff88810c3fe000 R15: 0000000000000000 [ 1980.862830] FS: 0000000000000000(0000) GS:ffff88881ca00000(0000) knlGS:0000000000000000 [ 1980.863370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1980.863758] CR2: 00007fa08aea8276 CR3: 000000010f59c005 CR4: 0000000000370ef0 [ 1980.864239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1980.864708] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1980.865202] Call Trace: [ 1980.865356] <TASK> [ 1980.865524] ? __warn+0x80/0x180 [ 1980.865745] ? uprobe_unregister_nosync+0x68/0x80 [ 1980.866074] ? report_bug+0x18d/0x1c0 [ 1980.866326] ? handle_bug+0x3a/0x70 [ 1980.866568] ? exc_invalid_op+0x13/0x60 [ 1980.866836] ? asm_exc_invalid_op+0x16/0x20 [ 1980.867098] ? uprobe_unregister_nosync+0x68/0x80 [ 1980.867390] ? uprobe_unregister_nosync+0x68/0x80 [ 1980.867726] bpf_uprobe_multi_link_release+0x31/0xd0 [ 1980.868044] bpf_link_free+0x54/0xd0 [ 1980.868267] bpf_link_release+0x17/0x20 [ 1980.868542] __fput+0x102/0x2e0 [ 1980.868760] task_work_run+0x55/0xa0 [ 1980.869027] syscall_exit_to_user_mode+0x1dd/0x1f0 [ 1980.869344] do_syscall_64+0x70/0x140 [ 1980.869603] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 1980.869923] RIP: 0033:0x7fa08aea82a0 [ 1980.870171] Code: Unable to access opcode bytes at 0x7fa08aea8276. [ 1980.870587] RSP: 002b:00007ffe838cd030 EFLAGS: 00000202 ORIG_RAX: 000000000000003b [ 1980.871098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1980.871563] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1980.872055] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 1980.872526] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1980.873044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1980.873568] </TASK> I traced it a little bit with retsnoop to figure out where this is coming from, and here we go: 14:53:18.897165 -> 14:53:18.897171 TID/PID 23013/23013 (exe/exe): FUNCTION CALLS RESULT DURATION --------------------------- -------- -------- → uprobe_write_opcode → get_user_pages_remote ↔ __get_user_pages [-EINTR] 1.382us ← get_user_pages_remote [-EINTR] 4.889us ← uprobe_write_opcode [-EINTR] 6.908us entry_SYSCALL_64_after_hwframe+0x76 (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130) do_syscall_64+0x70 (arch/x86/entry/common.c:102) syscall_exit_to_user_mode+0x1dd (kernel/entry/common.c:218) . __syscall_exit_to_user_mode_work (kernel/entry/common.c:207) . exit_to_user_mode_prepare (include/linux/entry-common.h:328) . exit_to_user_mode_loop (kernel/entry/common.c:114) . resume_user_mode_work (include/linux/resume_user_mode.h:50) task_work_run+0x55 (kernel/task_work.c:222) __fput+0x102 (fs/file_table.c:422) bpf_link_release+0x17 (kernel/bpf/syscall.c:3116) bpf_link_free+0x54 (kernel/bpf/syscall.c:3067) bpf_uprobe_multi_link_release+0x31 (kernel/trace/bpf_trace.c:3198) . bpf_uprobe_unregister (kernel/trace/bpf_trace.c:3186) uprobe_unregister_nosync+0x42 (kernel/events/uprobes.c:1120) register_for_each_vma+0x427 (kernel/events/uprobes.c:1092) 6us [-EINTR] uprobe_write_opcode+0x79 (kernel/events/uprobes.c:478) . get_user_page_vma_remote (include/linux/mm.h:2489) 4us [-EINTR] get_user_pages_remote+0x109 (mm/gup.c:2627) . __get_user_pages_locked (mm/gup.c:1762) ! 1us [-EINTR] __get_user_pages So, we do uprobe_unregister -> register_for_each_vma -> remove_breakpoint -> set_orig_insn -> uprobe_write_opcode -> get_user_page_vma_remote -> get_user_pages_remote -> __get_user_pages_locked -> __get_user_pages and I think we then hit `if (fatal_signal_pending(current)) return -EINTR;` check. So, is there something smarter we can do in this case besides leaking an uprobe (and note, my changes don't change this behavior)? I can of course just drop the WARN given it's sort of expected now, but if we can handle that more gracefully it would be good. I don't think that should block optimization work, but just something to keep in mind and maybe fix as a follow up. > } > up_write(&uprobe->register_rwsem); > [...]