Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 26 Mar 2024 09:50:37 -0700

On Mon, Mar 25, 2024 at 8:13 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > > > >
> > > > > >
> > > > > > There are still ways to reduce number of "wasted" records further, this is
> > > > > > a problem that requires many small and rather independent steps.
> > > > >
> > > > > I feel this is a wrong path to follow.
> > > > > I think it would be better to introduce a flag for kprobe/fentry
> > > > > to do perf_snapshot_branch_stack() as early as possible
> > > > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > > > its leasure.
> > > >
> > > > This is basically how Song started when he was adding this feature a
> > > > few years ago. And I feel like we discussed this and decided that it
> > > > would be cleaner to let the BPF program decide when (and whether) to
> > > > get LBR, based on conditions. It still feels like a right tradeoff.
> > >
> > > Right we discussed it back then and at that time it was about
> > > collecting stacks.
> > > What's different now is you want to collect all types of branches
> >
> > I was using --lbr=any
>
> and that's probably something to fix.

Fix in the sense to adjust or add another generic
PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?

> I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> To figure out the path to return in the code
> ARCH_LBR_JCC      |
> ARCH_LBR_REL_CALL |
> ARCH_LBR_IND_CALL |
> ARCH_LBR_RETURN
>
> might be good enough and there won't be a need to do
> inling in odd places just to avoid tail jmp.

retsnoop supports all modes perf exposes generically (see [0]), I
believe I tried all of them and keep gravitating back to --lbr=any as
most useful, unfortunately.

But it's ok, let's put this particular __always_inline on pause for
now, it's one LBR record more, not the end of the world.

  [0] https://github.com/anakryiko/retsnoop/blob/master/src/retsnoop.c#L269-L280

>
> > do this to get almost all the benefit. I just need to inline
> > bpf_get_branch_snapshot().
>
> If that is the only one that needs inling then fine,

yes, it's one of the most important ones, I'll take it :)

> but I really don't like to always_inline
> kprobe_multi_link_prog_run().
> A day goes by and somebody will send a patch
> to save 500 bytes of kernel .text by removing always_inline.
> The argument that it's there to help a user space tool that
> wants to do lbr=all instead of excluding rel_jmp
> won't look good.
>

There is a lot of __always_inline in rethook/ftrace code, as well as
some in BPF code as well. I don't remember people trying to roll this
back, so this seems a bit overdramatic. But ok, if you think it will
be problematic to reject such hypothetical patches, let's put
kprobe_multi_link_prog_run inlining aside for now.

> >
> > I don't think we can get rid of migrate_disable, we need to make sure
> > we are freezing LBR on the CPU on which BPF program will run. So it's
> > either preempt_disable or migrate_disable.
>
> we cannot extend preempt disable across the prog
> and migrate_disable won't really help, since
> there could be another prog on the same cpu
> doing the same "save lbr" action in a different hook
> that will trash per-cpu scratch space.
>
> But we don't need to either migrate_disable or preempt_disable.
> We can have a 32*24 byte buffer per attach point.

I'm missing how we can get away from having a per-CPU buffer. LBRs on
different CPU cores are completely independent and one BPF prog
attachment can be simultaneously running on many CPUs.

Or do you mean per-CPU allocation for each attach point?

We can do LBR capture before migrate_disable calls and still have
correct data most of the time (hopefully), though, so yeah, it can be
another improvement (but with inlining of those two BPF helpers I'm
not sure we have to do this just yet).

> In case of fentry it can be in bpf_trampoline or in bpf_link
> (I didn't analyze pros/cons too far) and
> fentry will only do the single "call intel_pmu_snapshot_branch_stack"
> with that address.
> That's a trivial addition to arch_prepare_bpf_trampoline.
>
> Then bpf prog will take entries from link, since it has access to it.
>
> Same thing for kprobes. As soon as it's triggered it will
> call intel_pmu_snapshot_branch_stack.
> Should be simple to add.
>
> Recursion can overwrite that per-attach buffer, but
> lbr is screwed anyway if we recursed. So not a concern.
>
> > Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
> > implemented, we call into original function and it returns into
> > trampoline. So this seems unavoidable as well without completely
> > changing how trampoline works for fexit. Multi-kprobe actually,
> > conveniently, avoids this problem.
>
> Definitely do not want to redesign that to help retsnoop save an lbr entry.

Yep.

>
> > > Inlining bpf_get_branch_snapshot() may be ok,
> > > but the way you're doing is not a clear win.
> > > Replacing size / 24 (that compiler optimize into mul)
> > > with actual divide and adding extra mul will be slower.
> > > div+mul vs mul is quite a difference.
> > > How noticable is that is questionable,
> > > but from inlining perspective it doesn't feel right to do
> > > "inline to avoid extra frame" instead of
> > > "inline to improve performance".
> >
> > Yes, I saw this division-through-multiplication division. I can
> > replicate that as well, but it will be pretty hard to understand, so I
> > thought it is probably not worth it. Note that
> > bpf_get_branch_snapshot() is not some sort of performance-critical
> > helper, if you are calling it on some super frequent kprobe/fentry,
> > you are paying a lot of price just for copying 700+ bytes (and
> > probably a bunch of other stuff).
>
> div is the slowest instruction.
> On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles.
> L1 cache is 1-2.
> So it might be faster to copy 768 bytes than do a single div.

Ok, I can add the div-through-multiplication code to keep it 1:1 w/
compiled helper code, no problem.

>
> I still think that adding call intel_pmu_snapshot_branch_stack
> to fenty and kprobes is a simpler and cleaner solution that eliminates
> all guess work of compiler inlining and optimizations.
>
> We can potentially optimize it further.
> Since arch_prepare_bpf_trampoline() is arch specific,
> for x86 we can inline:
>         local_irq_save(flags);
>         __intel_pmu_disable_all(false);
>         __intel_pmu_lbr_disable();
> into generated trampoline (since above is just 5-6 instructions)
> and call into __intel_pmu_snapshot_branch_stack.

Let's keep it as plan B, given this gets into gnarly internals of
Intel-specific x86-64 code.