Dear all, On Thu, May 11, 2023 at 9:06 AM Ze Gao <zegao2021@xxxxxxxxx> wrote: > > I'm afraid filtering in user space tools is not enough, cause it's a kernel BUG. > > it would 100% trigger a kernel crash if you run cmd like > > retsnoop -e 'pick_next_task_fair' -a ':kernel/sched/*.c' -vvv > > which is caused by that BPF_LINK_TYPE_KPROBE_MULTI accidentally > attaches bpf progs > to preempt_count_{add, sub}, which in turn triggers stackoverflow > because the handler itself > calls those functions. I managed to see the big picture of this problem by digging into the code. here is what it goes: rethook_trampoline_handler{ preempt_disable() { preempt_count_add() { <-- trace fprobe_handler() { ftrace_test_recursion_trylock ... ftrace_test_recursion_unlock <- it fails to detect the recursion caused by rethook (rethook_trampoline_handler precisely for this case) routines. } ... rethook_trampoline_handler { [ wash, rinse, repeat, CRASH!!! ] There are some pitfalls here: 1. fprobe exit callback should be guarded by ftrace recursion check as well since user might call any traceable functions just like kprobe_multi_link_prog_run calls migrate_{disable, enable}. In this case, detection in fprobe_handler only is not enough. 2. rethook_trampoline_handler should use preempt_{disable, enable}_notrace instead because it's beyond recursion-free region guarded like 1 3. many rethook helper functions are also used outside the recursion-free regions and therefore they should be marked notrace I've post a new series of patches to resolve cases mentioned above: [Link]: https://lkml.org/lkml/2023/5/14/452 In theory, bpf_kprobe as the user of fprobe+ rethook, is spared from suffering recursion again by applying these. And [Link]: https://lore.kernel.org/all/20230513090548.376522-1-zegao@xxxxxxxxxxx/T/#u becomes optional. That leaves one final question, whether we need probe blacklist or bpf_kprobe blacklist depends on how we deal with user requests if one of its expected hook points fails because of recursion detected. Do we need to reject them in the first place by blacklist or let it fail to execute the callback silently, which needs your gentle advice. Regards, Ze