Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 28 Mar 2024 15:53:58 -0700

On Wed, Mar 27, 2024 at 2:59 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > Fix in the sense to adjust or add another generic
> > PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?
> >
> > > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> > > To figure out the path to return in the code
> > > ARCH_LBR_JCC      |
> > > ARCH_LBR_REL_CALL |
> > > ARCH_LBR_IND_CALL |
> > > ARCH_LBR_RETURN
> > >
> > > might be good enough and there won't be a need to do
> > > inling in odd places just to avoid tail jmp.
> >
> > retsnoop supports all modes perf exposes generically (see [0]), I
> > believe I tried all of them and keep gravitating back to --lbr=any as
> > most useful, unfortunately.
>
> I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND
> which I suspect will exclude ARCH_LBR_REL_JMP
> and will avoid counting normal goto-s.

This would be equivalent to passing `--lbr=any_call --lbr=cond` to
retsnoop. And yes, you are right that it doesn't record unconditional
jumps, saving a few more LBR frames. The problem is that it makes it
super hard to follow what's going on without disassembling all
relevant functions down to assembly and following *very carefully* to
understand the flow of logic. This is normally absolutely unnecessary
with --lbr=any (unless DWARF line info is screwed up, of course).
--lbr=any allows to understand C statement-level code flow based on
file:line information alone.

So, in summary, yes `--lbr=any_call --lbr=cond` is a good last resort
option if a few more LBR records are necessary. But it doesn't feel
like something I'd recommend typical users to start with.

>
> > I'm missing how we can get away from having a per-CPU buffer. LBRs on
> > different CPU cores are completely independent and one BPF prog
> > attachment can be simultaneously running on many CPUs.
> >
> > Or do you mean per-CPU allocation for each attach point?
> >
> > We can do LBR capture before migrate_disable calls and still have
> > correct data most of the time (hopefully), though, so yeah, it can be
> > another improvement (but with inlining of those two BPF helpers I'm
> > not sure we have to do this just yet).
>
> I meant 32*24 buffer per attachment.
> Doing per-attachment per-cpu might not scale?
> It's certainly cleaner with per-cpu though.
> With a single per-attach buffer the assumption was that the different
> cpus will likely take the same path towards that kprobe.

This is a rather bold assumption. It might be true in a lot of cases,
but certainly not always. Think about tracing __sys_bpf for errors.
There could be many simultaneous bpf() syscalls in the system, some
returning expected -ENOENT for expected map element iteration (so not
really interesting), but some resulting in confusing failures (and
thus "interesting"). It's even worse with some file system related
functions.

My point is that in retsnoop I don't want to rely on these
assumptions. I won't stop anyone trying to add this functionality in
the kernel, but I don't see retsnoop relying on something like that.

> retsnoop doesn't care which cpu it collected that stack trace from.
> It cares about the code path and it will be there.

It does, retsnoop has a bunch of filters to narrow down specific
conditions, where process information is taken into account, among
other things. And this filtering capabilities will just grow over
time.

> We can make the whole thing configurable.
> bpf_link_attach would specify a buffer or per-cpu or
> a buffer right in the bpf map array that should be used
> to store the lbr trace.

I'm pretty happy with existing functionality and don't really need new
APIs. I'll be posting a few more patches improving performance (not
just LBR usage) over the next few days. With those I get close enough:
4 LBR records waste with --lbr=any.

Thanks for the brainstorming, internal per-CPU instruction
implementation turned out pretty well, I'll be sending a patch set
soon.