On Tue, Feb 06, 2024 at 05:22PM -0800, Martin KaFai Lau wrote: > On 2/6/24 9:04 AM, Marco Elver wrote: > > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote: > > [...] > > > > Or can you suggest different functions to hook to for the recursion test? > > > > > > I don't prefer to add another tracepoint for the selftest. > > > > Ok - I also checked, even though it should be a no-op, it wasn't > > (compiler generated worse code). > > I am interested to how the tracepoint generates worse code. Can you share > some details ? My guess is that it produces enough code that some inlinable functions are no longer being inlined. Specifically __bpf_task_storage_get(). > > > > > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the > > > initial bpf_local_storage_lookup() should work and the immediate recurred > > > bpf_task_storage_delete() will fail. > > > > > > Depends on how the new slow path function will look like in v2. The test can > > > probably be made to go through the slow path, e.g. by creating a lot of task > > > storage maps before triggering the lookup. [...] > > Could you suggest how we can fix up the tests? I'm a little stuck > > because there's not much we can hook to left. > > I don't see a solution either if only the cache insertion code path is in a > traceable function. > > The prog->active counter has already been covered in another test. This test > is mostly only covering the lookup => delete recur case and the code path is > contained within the bpf storage logic. The future code review should be > able to cover. I would make an exception here and remove this test case > considering anything (e.g. tracepoint) we do here is likely to make it > worse. (more on the test removal below). > > > > > Thanks, > > -- Marco > > > > ------ >8 ------ > > > > From: Marco Elver <elver@xxxxxxxxxx> > > Date: Tue, 30 Jan 2024 17:57:45 +0100 > > Subject: [PATCH v2] bpf: Allow compiler to inline most of > > bpf_local_storage_lookup() > > > > In various performance profiles of kernels with BPF programs attached, > > bpf_local_storage_lookup() appears as a significant portion of CPU > > cycles spent. To enable the compiler generate more optimal code, turn > > bpf_local_storage_lookup() into a static inline function, where only the > > cache insertion code path is outlined (call instruction can be elided > > entirely if cacheit_lockit is a constant expression). > > Can you share more why only putting the cache insertion code to a function > improves the larger number of maps case. In the benchmark, cacheit_lockit > should always be true and __bpf_local_storage_insert_cache() should always > be called. Keeping bpf_local_storage_lookup() smaller (even if just outlining the cache insertion) makes a difference as it allows the compiler generate more optimal code, specifically we avoid duplicating setting up calls to _raw_spin_lock/unlock. E.g. __bpf_task_storage_get is not being inlined anymore if bpf_local_storage_lookup() becomes too large (i.e. everything is up for inlining incl. cache insertion). Also, on x86 preempt builds, spin_lock/unlock aren't inlinable, so we have to pay the price of 2 calls regardless: previously for calls to _raw_spin_lock_irqsave and to _raw_spin_unlock_irqsave. However, with the version of __bpf_local_storage_insert_cache in my patch, the call to _raw_spin_unlock_irqsave is tail called, which allows the compiler to perform TCO, i.e. we still only pay the price of 2 calls: one to __bpf_local_storage_insert_cache and to _raw_spin_lock_irqsave (but no call to _raw_spin_unlock_irqsave, which can just be jumped to): <__bpf_local_storage_insert_cache>: endbr64 nopl 0x0(%rax,%rax,1) push %r15 push %r14 push %r12 push %rbx mov %rdx,%rbx mov %rsi,%r12 mov %rdi,%r15 lea 0xa8(%rdi),%r14 mov %r14,%rdi call ffffffff82323650 <_raw_spin_lock_irqsave> cmpq $0x0,0x18(%rbx) je ffffffff8127ea80 <__bpf_local_storage_insert_cache+0x40> add $0x40,%rbx movzwl 0x10e(%r12),%ecx mov %rbx,(%r15,%rcx,8) mov %r14,%rdi mov %rax,%rsi pop %rbx pop %r12 pop %r14 pop %r15 jmp ffffffff823237d0 <_raw_spin_unlock_irqrestore> <--- TCO I also compared a version where _everything_ is inlined vs. the one with __bpf_local_storage_insert_cache outlined: the one where everything is inlined nullifies any performance improvements and is significantly worse than the one with __bpf_local_storage_insert_cache outlined. [...] > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/??????????????????????????") > int BPF_PROG(on_lookup) > > Remove this BPF_PROG. > > > { > > struct task_struct *task = bpf_get_current_task_btf(); > > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c > > index 4542dc683b44..d73b33a4c153 100644 > > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c > > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c > > @@ -27,7 +27,7 @@ struct { > > __type(value, long); > > } map_b SEC(".maps"); > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/??????????????????????????") > > Same here. The checks related to on_lookup in > prog_tests/task_local_storage.c need to be removed also. > > > int BPF_PROG(on_lookup) > > { > > struct task_struct *task = bpf_get_current_task_btf(); > Thanks, -- Marco