On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > There are still ways to reduce number of "wasted" records further, this is > > > > > a problem that requires many small and rather independent steps. > > > > > > > > I feel this is a wrong path to follow. > > > > I think it would be better to introduce a flag for kprobe/fentry > > > > to do perf_snapshot_branch_stack() as early as possible > > > > and then bpf prog can copy these 16 or 32 8-byte entries at > > > > its leasure. > > > > > > This is basically how Song started when he was adding this feature a > > > few years ago. And I feel like we discussed this and decided that it > > > would be cleaner to let the BPF program decide when (and whether) to > > > get LBR, based on conditions. It still feels like a right tradeoff. > > > > Right we discussed it back then and at that time it was about > > collecting stacks. > > What's different now is you want to collect all types of branches > > I was using --lbr=any and that's probably something to fix. I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled. To figure out the path to return in the code ARCH_LBR_JCC | ARCH_LBR_REL_CALL | ARCH_LBR_IND_CALL | ARCH_LBR_RETURN might be good enough and there won't be a need to do inling in odd places just to avoid tail jmp. > do this to get almost all the benefit. I just need to inline > bpf_get_branch_snapshot(). If that is the only one that needs inling then fine, but I really don't like to always_inline kprobe_multi_link_prog_run(). A day goes by and somebody will send a patch to save 500 bytes of kernel .text by removing always_inline. The argument that it's there to help a user space tool that wants to do lbr=all instead of excluding rel_jmp won't look good. > > I don't think we can get rid of migrate_disable, we need to make sure > we are freezing LBR on the CPU on which BPF program will run. So it's > either preempt_disable or migrate_disable. we cannot extend preempt disable across the prog and migrate_disable won't really help, since there could be another prog on the same cpu doing the same "save lbr" action in a different hook that will trash per-cpu scratch space. But we don't need to either migrate_disable or preempt_disable. We can have a 32*24 byte buffer per attach point. In case of fentry it can be in bpf_trampoline or in bpf_link (I didn't analyze pros/cons too far) and fentry will only do the single "call intel_pmu_snapshot_branch_stack" with that address. That's a trivial addition to arch_prepare_bpf_trampoline. Then bpf prog will take entries from link, since it has access to it. Same thing for kprobes. As soon as it's triggered it will call intel_pmu_snapshot_branch_stack. Should be simple to add. Recursion can overwrite that per-attach buffer, but lbr is screwed anyway if we recursed. So not a concern. > Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is > implemented, we call into original function and it returns into > trampoline. So this seems unavoidable as well without completely > changing how trampoline works for fexit. Multi-kprobe actually, > conveniently, avoids this problem. Definitely do not want to redesign that to help retsnoop save an lbr entry. > > Inlining bpf_get_branch_snapshot() may be ok, > > but the way you're doing is not a clear win. > > Replacing size / 24 (that compiler optimize into mul) > > with actual divide and adding extra mul will be slower. > > div+mul vs mul is quite a difference. > > How noticable is that is questionable, > > but from inlining perspective it doesn't feel right to do > > "inline to avoid extra frame" instead of > > "inline to improve performance". > > Yes, I saw this division-through-multiplication division. I can > replicate that as well, but it will be pretty hard to understand, so I > thought it is probably not worth it. Note that > bpf_get_branch_snapshot() is not some sort of performance-critical > helper, if you are calling it on some super frequent kprobe/fentry, > you are paying a lot of price just for copying 700+ bytes (and > probably a bunch of other stuff). div is the slowest instruction. On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles. L1 cache is 1-2. So it might be faster to copy 768 bytes than do a single div. I still think that adding call intel_pmu_snapshot_branch_stack to fenty and kprobes is a simpler and cleaner solution that eliminates all guess work of compiler inlining and optimizations. We can potentially optimize it further. Since arch_prepare_bpf_trampoline() is arch specific, for x86 we can inline: local_irq_save(flags); __intel_pmu_disable_all(false); __intel_pmu_lbr_disable(); into generated trampoline (since above is just 5-6 instructions) and call into __intel_pmu_snapshot_branch_stack.