Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Sun, 24 Mar 2024 19:05:42 -0700

On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > >
> > >
> > > There are still ways to reduce number of "wasted" records further, this is
> > > a problem that requires many small and rather independent steps.
> >
> > I feel this is a wrong path to follow.
> > I think it would be better to introduce a flag for kprobe/fentry
> > to do perf_snapshot_branch_stack() as early as possible
> > and then bpf prog can copy these 16 or 32 8-byte entries at
> > its leasure.
>
> This is basically how Song started when he was adding this feature a
> few years ago. And I feel like we discussed this and decided that it
> would be cleaner to let the BPF program decide when (and whether) to
> get LBR, based on conditions. It still feels like a right tradeoff.

Right we discussed it back then and at that time it was about
collecting stacks.
What's different now is you want to collect all types of branches
in retrnoop including plain 'jmp pc+10' and conditional jmps.
This is not something that C code can control.
always_inline in C and inline by the verifier reduce call frames,
but they may have both positive and negative effect when
all branches are collected.
Hence __always_inline in kprobe_multi_link_prog_run()
is a leap of faith with assumptions that compiler won't
add jmps before calling into prog,
but lots of different compiler flags add instrumentation:
kasan, stack protector, security mitigation that count call depth, etc.

> Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
> that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
> doesn't have to happen so fast. BPF program can evaluate some
> conditions and grab LBR optionally, saving the overhead.
>
> With prog flag saying "kernel should capture LBR ASAP", we:

I was suggesting to use per attachment flag.
And kprobe is a lost cause.
I would do it for fentry only where
we can generate 'save LBR' call first thing in the bpf trampoline.

>   a) lose this flexibility to decide whether and when to grab LBR;
>   b) pay overhead regardless if LBR is ever actually used for any
> given prog invocation;

when retsnoop attaches a prog the prog gotta call that 'save LBR'
as soon as possible without any branches.
So per-attach flag is not really a downside.

>   c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
> buffers for something that is pretty niche (though hugely valuable
> when needed, of course);

I wouldn't worry about such a tiny buffer.

>   d) each program type that supports bpf_get_branch_snapshot() helper
> needs to implement this logic in their corresponding
> `bpf_prog_run_xxx()` running helpers, which is more than a few places.

I think new kfunc that copies from the buffer will do.
Nothing needs to change.
Maybe bpf_get_branch_snapshot() can be made smart too,
but that is optional.

> Now, let's see how much we can also realistically save with this approach.
>
> For fentry, we do save a few (2) entries, indeed. With changes in this
> patch we are at:
>
>   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
>   [#06] __x64_sys_bpf+0x1a                       ->
> bpf_trampoline_6442508829+0x7f
>   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
>   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
>   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
>   [#02] __bpf_prog_enter_recur+0x43              ->
> bpf_trampoline_6442508829+0xa1
>   [#01] bpf_trampoline_6442508829+0xad           ->
> bpf_prog_dc54a596b39d4177_fexit1+0x0
>   [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
> intel_pmu_snapshot_branch_stack+0x0
>
> With flag and kernel support, we'll be at something like
>
>   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
>   [#06] __x64_sys_bpf+0x1a                       ->
> bpf_trampoline_6442508829+0x7f
>   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
>   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
>   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
>   [#02] __bpf_prog_enter_recur+0x43              ->
> intel_pmu_snapshot_branch_stack+0x0

with flag migrate_disable and prog_enter* will be gone.
It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack.
If we try hard we can inline
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
Then it will be bpf_trampoline_ only and
that's as minimal as it can get. One entry.
Hacking all over the kernel with inline won't get anywhere close.

> For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
> at all. Is it really worth it?

any_return is ok today.
Especially with fentry.

> While inlining bpf_get_branch_snapshot() does benefit only LBR use
> case, it's a rather typical BPF helper inlining procedure we do for a
> lot of helpers, so it's not exactly a hack or anything, just an
> optimization.

Inlining bpf_get_branch_snapshot() may be ok,
but the way you're doing is not a clear win.
Replacing size / 24 (that compiler optimize into mul)
with actual divide and adding extra mul will be slower.
div+mul vs mul is quite a difference.
How noticable is that is questionable,
but from inlining perspective it doesn't feel right to do
"inline to avoid extra frame" instead of
"inline to improve performance".

> But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
> pretty frequently used helper used to implement various
> BPF-program-specific per-CPU usages (recursion protection, temporary
> storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
> variables, which is already a bit faster approach, and now will be
> even faster). And the implementation is well-contained.

Agree that inlining bpf_get_smp_processor_id() is a good thing,
but please do it cleanly so that per-cpu accessors can be
reused in other places.
I'll reply with details in the other thread.