On Wed, Mar 27, 2024 at 2:59 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > Fix in the sense to adjust or add another generic > > PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode? > > > > > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled. > > > To figure out the path to return in the code > > > ARCH_LBR_JCC | > > > ARCH_LBR_REL_CALL | > > > ARCH_LBR_IND_CALL | > > > ARCH_LBR_RETURN > > > > > > might be good enough and there won't be a need to do > > > inling in odd places just to avoid tail jmp. > > > > retsnoop supports all modes perf exposes generically (see [0]), I > > believe I tried all of them and keep gravitating back to --lbr=any as > > most useful, unfortunately. > > I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND > which I suspect will exclude ARCH_LBR_REL_JMP > and will avoid counting normal goto-s. This would be equivalent to passing `--lbr=any_call --lbr=cond` to retsnoop. And yes, you are right that it doesn't record unconditional jumps, saving a few more LBR frames. The problem is that it makes it super hard to follow what's going on without disassembling all relevant functions down to assembly and following *very carefully* to understand the flow of logic. This is normally absolutely unnecessary with --lbr=any (unless DWARF line info is screwed up, of course). --lbr=any allows to understand C statement-level code flow based on file:line information alone. So, in summary, yes `--lbr=any_call --lbr=cond` is a good last resort option if a few more LBR records are necessary. But it doesn't feel like something I'd recommend typical users to start with. > > > I'm missing how we can get away from having a per-CPU buffer. LBRs on > > different CPU cores are completely independent and one BPF prog > > attachment can be simultaneously running on many CPUs. > > > > Or do you mean per-CPU allocation for each attach point? > > > > We can do LBR capture before migrate_disable calls and still have > > correct data most of the time (hopefully), though, so yeah, it can be > > another improvement (but with inlining of those two BPF helpers I'm > > not sure we have to do this just yet). > > I meant 32*24 buffer per attachment. > Doing per-attachment per-cpu might not scale? > It's certainly cleaner with per-cpu though. > With a single per-attach buffer the assumption was that the different > cpus will likely take the same path towards that kprobe. This is a rather bold assumption. It might be true in a lot of cases, but certainly not always. Think about tracing __sys_bpf for errors. There could be many simultaneous bpf() syscalls in the system, some returning expected -ENOENT for expected map element iteration (so not really interesting), but some resulting in confusing failures (and thus "interesting"). It's even worse with some file system related functions. My point is that in retsnoop I don't want to rely on these assumptions. I won't stop anyone trying to add this functionality in the kernel, but I don't see retsnoop relying on something like that. > retsnoop doesn't care which cpu it collected that stack trace from. > It cares about the code path and it will be there. It does, retsnoop has a bunch of filters to narrow down specific conditions, where process information is taken into account, among other things. And this filtering capabilities will just grow over time. > We can make the whole thing configurable. > bpf_link_attach would specify a buffer or per-cpu or > a buffer right in the bpf map array that should be used > to store the lbr trace. I'm pretty happy with existing functionality and don't really need new APIs. I'll be posting a few more patches improving performance (not just LBR usage) over the next few days. With those I get close enough: 4 LBR records waste with --lbr=any. Thanks for the brainstorming, internal per-CPU instruction implementation turned out pretty well, I'll be sending a patch set soon.