Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 25 Mar 2024 20:13:15 -0700

On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > > There are still ways to reduce number of "wasted" records further, this is
> > > > > a problem that requires many small and rather independent steps.
> > > >
> > > > I feel this is a wrong path to follow.
> > > > I think it would be better to introduce a flag for kprobe/fentry
> > > > to do perf_snapshot_branch_stack() as early as possible
> > > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > > its leasure.
> > >
> > > This is basically how Song started when he was adding this feature a
> > > few years ago. And I feel like we discussed this and decided that it
> > > would be cleaner to let the BPF program decide when (and whether) to
> > > get LBR, based on conditions. It still feels like a right tradeoff.
> >
> > Right we discussed it back then and at that time it was about
> > collecting stacks.
> > What's different now is you want to collect all types of branches
>
> I was using --lbr=any

and that's probably something to fix.
I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
To figure out the path to return in the code
ARCH_LBR_JCC      |
ARCH_LBR_REL_CALL |
ARCH_LBR_IND_CALL |
ARCH_LBR_RETURN

might be good enough and there won't be a need to do
inling in odd places just to avoid tail jmp.

> do this to get almost all the benefit. I just need to inline
> bpf_get_branch_snapshot().

If that is the only one that needs inling then fine,
but I really don't like to always_inline
kprobe_multi_link_prog_run().
A day goes by and somebody will send a patch
to save 500 bytes of kernel .text by removing always_inline.
The argument that it's there to help a user space tool that
wants to do lbr=all instead of excluding rel_jmp
won't look good.

>
> I don't think we can get rid of migrate_disable, we need to make sure
> we are freezing LBR on the CPU on which BPF program will run. So it's
> either preempt_disable or migrate_disable.

we cannot extend preempt disable across the prog
and migrate_disable won't really help, since
there could be another prog on the same cpu
doing the same "save lbr" action in a different hook
that will trash per-cpu scratch space.

But we don't need to either migrate_disable or preempt_disable.
We can have a 32*24 byte buffer per attach point.
In case of fentry it can be in bpf_trampoline or in bpf_link
(I didn't analyze pros/cons too far) and
fentry will only do the single "call intel_pmu_snapshot_branch_stack"
with that address.
That's a trivial addition to arch_prepare_bpf_trampoline.

Then bpf prog will take entries from link, since it has access to it.

Same thing for kprobes. As soon as it's triggered it will
call intel_pmu_snapshot_branch_stack.
Should be simple to add.

Recursion can overwrite that per-attach buffer, but
lbr is screwed anyway if we recursed. So not a concern.

> Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
> implemented, we call into original function and it returns into
> trampoline. So this seems unavoidable as well without completely
> changing how trampoline works for fexit. Multi-kprobe actually,
> conveniently, avoids this problem.

Definitely do not want to redesign that to help retsnoop save an lbr entry.

> > Inlining bpf_get_branch_snapshot() may be ok,
> > but the way you're doing is not a clear win.
> > Replacing size / 24 (that compiler optimize into mul)
> > with actual divide and adding extra mul will be slower.
> > div+mul vs mul is quite a difference.
> > How noticable is that is questionable,
> > but from inlining perspective it doesn't feel right to do
> > "inline to avoid extra frame" instead of
> > "inline to improve performance".
>
> Yes, I saw this division-through-multiplication division. I can
> replicate that as well, but it will be pretty hard to understand, so I
> thought it is probably not worth it. Note that
> bpf_get_branch_snapshot() is not some sort of performance-critical
> helper, if you are calling it on some super frequent kprobe/fentry,
> you are paying a lot of price just for copying 700+ bytes (and
> probably a bunch of other stuff).

div is the slowest instruction.
On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles.
L1 cache is 1-2.
So it might be faster to copy 768 bytes than do a single div.

I still think that adding call intel_pmu_snapshot_branch_stack
to fenty and kprobes is a simpler and cleaner solution that eliminates
all guess work of compiler inlining and optimizations.

We can potentially optimize it further.
Since arch_prepare_bpf_trampoline() is arch specific,
for x86 we can inline:
        local_irq_save(flags);
        __intel_pmu_disable_all(false);
        __intel_pmu_lbr_disable();
into generated trampoline (since above is just 5-6 instructions)
and call into __intel_pmu_snapshot_branch_stack.