Re: [PATCH] bpf: Separate bpf_local_storage_lookup() fast and slow paths

Marco Elver <elver@xxxxxxxxxx> · Wed, 7 Feb 2024 10:56:23 +0100

On Tue, Feb 06, 2024 at 05:22PM -0800, Martin KaFai Lau wrote:
> On 2/6/24 9:04 AM, Marco Elver wrote:
> > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote:
> > [...]
> > > > Or can you suggest different functions to hook to for the recursion test?
> > > 
> > > I don't prefer to add another tracepoint for the selftest.
> > 
> > Ok - I also checked, even though it should be a no-op, it wasn't
> > (compiler generated worse code).
> 
> I am interested to how the tracepoint generates worse code. Can you share
> some details ?

My guess is that it produces enough code that some inlinable functions
are no longer being inlined. Specifically __bpf_task_storage_get().

> > 
> > > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the
> > > initial bpf_local_storage_lookup() should work and the immediate recurred
> > > bpf_task_storage_delete() will fail.
> > > 
> > > Depends on how the new slow path function will look like in v2. The test can
> > > probably be made to go through the slow path, e.g. by creating a lot of task
> > > storage maps before triggering the lookup.
[...]
> > Could you suggest how we can fix up the tests? I'm a little stuck
> > because there's not much we can hook to left.
> 
> I don't see a solution either if only the cache insertion code path is in a
> traceable function.
> 
> The prog->active counter has already been covered in another test. This test
> is mostly only covering the lookup => delete recur case and the code path is
> contained within the bpf storage logic. The future code review should be
> able to cover. I would make an exception here and remove this test case
> considering anything (e.g. tracepoint) we do here is likely to make it
> worse. (more on the test removal below).
> 
> > 
> > Thanks,
> > -- Marco
> > 
> > ------ >8 ------
> > 
> > From: Marco Elver <elver@xxxxxxxxxx>
> > Date: Tue, 30 Jan 2024 17:57:45 +0100
> > Subject: [PATCH v2] bpf: Allow compiler to inline most of
> >   bpf_local_storage_lookup()
> > 
> > In various performance profiles of kernels with BPF programs attached,
> > bpf_local_storage_lookup() appears as a significant portion of CPU
> > cycles spent. To enable the compiler generate more optimal code, turn
> > bpf_local_storage_lookup() into a static inline function, where only the
> > cache insertion code path is outlined (call instruction can be elided
> > entirely if cacheit_lockit is a constant expression).
> 
> Can you share more why only putting the cache insertion code to a function
> improves the larger number of maps case. In the benchmark, cacheit_lockit
> should always be true and __bpf_local_storage_insert_cache() should always
> be called.

Keeping bpf_local_storage_lookup() smaller (even if just outlining the
cache insertion) makes a difference as it allows the compiler generate
more optimal code, specifically we avoid duplicating setting up calls to
_raw_spin_lock/unlock. E.g.  __bpf_task_storage_get is not being inlined
anymore if bpf_local_storage_lookup() becomes too large (i.e. everything
is up for inlining incl. cache insertion).

Also, on x86 preempt builds, spin_lock/unlock aren't inlinable, so we
have to pay the price of 2 calls regardless: previously for calls to
_raw_spin_lock_irqsave and to _raw_spin_unlock_irqsave. However, with
the version of __bpf_local_storage_insert_cache in my patch, the call to
_raw_spin_unlock_irqsave is tail called, which allows the compiler to
perform TCO, i.e. we still only pay the price of 2 calls: one to
__bpf_local_storage_insert_cache and to _raw_spin_lock_irqsave (but no
call to _raw_spin_unlock_irqsave, which can just be jumped to):

<__bpf_local_storage_insert_cache>:
  endbr64
  nopl   0x0(%rax,%rax,1)
  push   %r15
  push   %r14
  push   %r12
  push   %rbx
  mov    %rdx,%rbx
  mov    %rsi,%r12
  mov    %rdi,%r15
  lea    0xa8(%rdi),%r14
  mov    %r14,%rdi
  call   ffffffff82323650 <_raw_spin_lock_irqsave>
  cmpq   $0x0,0x18(%rbx)
  je     ffffffff8127ea80 <__bpf_local_storage_insert_cache+0x40>
  add    $0x40,%rbx
  movzwl 0x10e(%r12),%ecx

  mov    %rbx,(%r15,%rcx,8)
  mov    %r14,%rdi
  mov    %rax,%rsi
  pop    %rbx
  pop    %r12
  pop    %r14
  pop    %r15
  jmp    ffffffff823237d0 <_raw_spin_unlock_irqrestore>        <--- TCO

I also compared a version where _everything_ is inlined vs. the one with
__bpf_local_storage_insert_cache outlined: the one where everything is
inlined nullifies any performance improvements and is significantly
worse than the one with __bpf_local_storage_insert_cache outlined.

[...]
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????") >   int BPF_PROG(on_lookup)
> 
> Remove this BPF_PROG.
> 
> >   {
> >   	struct task_struct *task = bpf_get_current_task_btf();
> > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > index 4542dc683b44..d73b33a4c153 100644
> > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > @@ -27,7 +27,7 @@ struct {
> >   	__type(value, long);
> >   } map_b SEC(".maps");
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????")
> 
> Same here. The checks related to on_lookup in
> prog_tests/task_local_storage.c need to be removed also.
> 
> >   int BPF_PROG(on_lookup)
> >   {
> >   	struct task_struct *task = bpf_get_current_task_btf();
> 

Thanks,
-- Marco