On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > There are still ways to reduce number of "wasted" records further, this is > > a problem that requires many small and rather independent steps. > > I feel this is a wrong path to follow. > I think it would be better to introduce a flag for kprobe/fentry > to do perf_snapshot_branch_stack() as early as possible > and then bpf prog can copy these 16 or 32 8-byte entries at > its leasure. This is basically how Song started when he was adding this feature a few years ago. And I feel like we discussed this and decided that it would be cleaner to let the BPF program decide when (and whether) to get LBR, based on conditions. It still feels like a right tradeoff. Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this doesn't have to happen so fast. BPF program can evaluate some conditions and grab LBR optionally, saving the overhead. With prog flag saying "kernel should capture LBR ASAP", we: a) lose this flexibility to decide whether and when to grab LBR; b) pay overhead regardless if LBR is ever actually used for any given prog invocation; c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU buffers for something that is pretty niche (though hugely valuable when needed, of course); d) each program type that supports bpf_get_branch_snapshot() helper needs to implement this logic in their corresponding `bpf_prog_run_xxx()` running helpers, which is more than a few places. Now, let's see how much we can also realistically save with this approach. For fentry, we do save a few (2) entries, indeed. With changes in this patch we are at: [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 [#06] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508829+0x7f [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe [#02] __bpf_prog_enter_recur+0x43 -> bpf_trampoline_6442508829+0xa1 [#01] bpf_trampoline_6442508829+0xad -> bpf_prog_dc54a596b39d4177_fexit1+0x0 [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101 -> intel_pmu_snapshot_branch_stack+0x0 With flag and kernel support, we'll be at something like [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 [#06] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508829+0x7f [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe [#02] __bpf_prog_enter_recur+0x43 -> intel_pmu_snapshot_branch_stack+0x0 So we get 2 extra LBRs at the expense of all those downsides I mentioned above. But for kretprobe-multi it's even worse (just 1). With changes in this patch set, we are at: [#10] __sys_bpf+0xdfc -> arch_rethook_trampoline+0x0 [#09] arch_rethook_trampoline+0x27 -> arch_rethook_trampoline_callback+0x0 [#08] arch_rethook_trampoline_callback+0x31 -> rethook_trampoline_handler+0x0 [#07] rethook_trampoline_handler+0x6f -> fprobe_exit_handler+0x0 [#06] fprobe_exit_handler+0x3d -> rcu_is_watching+0x0 [#05] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42 [#04] fprobe_exit_handler+0xb4 -> kprobe_multi_link_exit_handler+0x0 [#03] kprobe_multi_link_exit_handler+0x31 -> migrate_disable+0x0 [#02] migrate_disable+0x37 -> kprobe_multi_link_exit_handler+0x36 [#01] kprobe_multi_link_exit_handler+0x5c -> bpf_prog_2b455b4f8a8d48c5_kexit+0x0 [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3 -> intel_pmu_snapshot_branch_stack+0x0 With custom flag support: [#10] __sys_bpf+0xdfc -> arch_rethook_trampoline+0x0 [#09] arch_rethook_trampoline+0x27 -> arch_rethook_trampoline_callback+0x0 [#08] arch_rethook_trampoline_callback+0x31 -> rethook_trampoline_handler+0x0 [#07] rethook_trampoline_handler+0x6f -> fprobe_exit_handler+0x0 [#06] fprobe_exit_handler+0x3d -> rcu_is_watching+0x0 [#05] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42 [#04] fprobe_exit_handler+0xb4 -> kprobe_multi_link_exit_handler+0x0 [#03] kprobe_multi_link_exit_handler+0x31 -> migrate_disable+0x0 [#02] migrate_disable+0x37 -> kprobe_multi_link_exit_handler+0x36 [#01] kprobe_multi_link_exit_handler+0x5c -> intel_pmu_snapshot_branch_stack+0x0 We save just 1 extra LBR record. For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings at all. Is it really worth it? Any other improvements (like flattening of rethook call pass somehow) will benefit both approaches equally. > Hacking all over the kernel and requiring bpf prog to call > bpf_get_branch_snapshot() in the first few instructions > looks like self inflicted pain. While inlining bpf_get_branch_snapshot() does benefit only LBR use case, it's a rather typical BPF helper inlining procedure we do for a lot of helpers, so it's not exactly a hack or anything, just an optimization. But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a pretty frequently used helper used to implement various BPF-program-specific per-CPU usages (recursion protection, temporary storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global variables, which is already a bit faster approach, and now will be even faster). And the implementation is well-contained.