Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 22 Mar 2024 09:45:18 -0700

On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> >
> > There are still ways to reduce number of "wasted" records further, this is
> > a problem that requires many small and rather independent steps.
>
> I feel this is a wrong path to follow.
> I think it would be better to introduce a flag for kprobe/fentry
> to do perf_snapshot_branch_stack() as early as possible
> and then bpf prog can copy these 16 or 32 8-byte entries at
> its leasure.

This is basically how Song started when he was adding this feature a
few years ago. And I feel like we discussed this and decided that it
would be cleaner to let the BPF program decide when (and whether) to
get LBR, based on conditions. It still feels like a right tradeoff.

Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
doesn't have to happen so fast. BPF program can evaluate some
conditions and grab LBR optionally, saving the overhead.

With prog flag saying "kernel should capture LBR ASAP", we:
  a) lose this flexibility to decide whether and when to grab LBR;
  b) pay overhead regardless if LBR is ever actually used for any
given prog invocation;
  c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
buffers for something that is pretty niche (though hugely valuable
when needed, of course);
  d) each program type that supports bpf_get_branch_snapshot() helper
needs to implement this logic in their corresponding
`bpf_prog_run_xxx()` running helpers, which is more than a few places.

Now, let's see how much we can also realistically save with this approach.

For fentry, we do save a few (2) entries, indeed. With changes in this
patch we are at:

  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->
bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->
bpf_trampoline_6442508829+0xa1
  [#01] bpf_trampoline_6442508829+0xad           ->
bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
intel_pmu_snapshot_branch_stack+0x0

With flag and kernel support, we'll be at something like

  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->
bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->
intel_pmu_snapshot_branch_stack+0x0

So we get 2 extra LBRs at the expense of all those downsides I mentioned above.

But for kretprobe-multi it's even worse (just 1). With changes in this
patch set, we are at:

  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->
arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->
rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->
kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->
kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->
bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3     ->
intel_pmu_snapshot_branch_stack+0x0

With custom flag support:

  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->
arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->
rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->
kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->
kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->
intel_pmu_snapshot_branch_stack+0x0

We save just 1 extra LBR record.

For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
at all. Is it really worth it?

Any other improvements (like flattening of rethook call pass somehow)
will benefit both approaches equally.

> Hacking all over the kernel and requiring bpf prog to call
> bpf_get_branch_snapshot() in the first few instructions
> looks like self inflicted pain.

While inlining bpf_get_branch_snapshot() does benefit only LBR use
case, it's a rather typical BPF helper inlining procedure we do for a
lot of helpers, so it's not exactly a hack or anything, just an
optimization.

But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
pretty frequently used helper used to implement various
BPF-program-specific per-CPU usages (recursion protection, temporary
storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
variables, which is already a bit faster approach, and now will be
even faster). And the implementation is well-contained.