kprobe_multi_link_prog_run() is called both for multi-kprobe and multi-kretprobe BPF programs from kprobe_multi_link_handler() and kprobe_multi_link_exit_handler(), respectively. kprobe_multi_link_prog_run() is doing all the relevant work, with those wrappers just satisfying ftrace's interfaces (kprobe callback is supposed to return int, while kretprobe returns void). With this structure compile performs tail-call optimization: Dump of assembler code for function kprobe_multi_link_exit_handler: 0xffffffff8122f1e0 <+0>: add $0xffffffffffffffc0,%rdi 0xffffffff8122f1e4 <+4>: mov %rcx,%rdx 0xffffffff8122f1e7 <+7>: jmp 0xffffffff81230080 <kprobe_multi_link_prog_run> This means that when trying to capture LBR that traces all indirect branches we are wasting an entry just to record that kprobe_multi_link_exit_handler called/jumped into kprobe_multi_link_prog_run. LBR entries are especially sparse on AMD CPUs (just 16 entries on latest CPUs vs typically 32 on latest Intel CPUs), and every entry counts (and we already have a bunch of other LBR entries spent getting to a BPF program), so it would be great to not waste any more than necessary. Marking it as just `static inline` doesn't change anything, compiler still performs tail call optimization only. But by marking kprobe_multi_link_prog_run() as __always_inline we ensure that compiler fully inlines it, avoiding jumps: Dump of assembler code for function kprobe_multi_link_exit_handler: 0xffffffff8122f4e0 <+0>: push %r15 0xffffffff8122f4e2 <+2>: push %r14 0xffffffff8122f4e4 <+4>: push %r13 0xffffffff8122f4e6 <+6>: push %r12 0xffffffff8122f4e8 <+8>: push %rbx 0xffffffff8122f4e9 <+9>: sub $0x10,%rsp 0xffffffff8122f4ed <+13>: mov %rdi,%r14 0xffffffff8122f4f0 <+16>: lea -0x40(%rdi),%rax ... 0xffffffff8122f590 <+176>: call 0xffffffff8108e420 <sched_clock> 0xffffffff8122f595 <+181>: sub %r14,%rax 0xffffffff8122f598 <+184>: add %rax,0x8(%rbx,%r13,1) 0xffffffff8122f59d <+189>: jmp 0xffffffff8122f541 <kprobe_multi_link_exit_handler+97> Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> --- kernel/trace/bpf_trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 434e3ece6688..0bebd6f02e17 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2796,7 +2796,7 @@ static u64 bpf_kprobe_multi_entry_ip(struct bpf_run_ctx *ctx) return run_ctx->entry_ip; } -static int +static __always_inline int kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link, unsigned long entry_ip, struct pt_regs *regs) { -- 2.43.0