Hello: This patch was applied to bpf/bpf-next.git (master) by Martin KaFai Lau <martin.lau@xxxxxxxxxx>: On Wed, 7 Feb 2024 13:26:17 +0100 you wrote: > In various performance profiles of kernels with BPF programs attached, > bpf_local_storage_lookup() appears as a significant portion of CPU > cycles spent. To enable the compiler generate more optimal code, turn > bpf_local_storage_lookup() into a static inline function, where only the > cache insertion code path is outlined > > Notably, outlining cache insertion helps avoid bloating callers by > duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on > architectures which do not inline spin_lock/unlock, such as x86), which > would cause the compiler produce worse code by deciding to outline > otherwise inlinable functions. The call overhead is neutral, because we > make 2 calls either way: either calling raw_spin_lock_irqsave() and > raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), > which calls raw_spin_lock_irqsave(), followed by a tail-call to > raw_spin_unlock_irqsave() where the compiler can perform TCO and (in > optimized uninstrumented builds) turns it into a plain jump. The call to > __bpf_local_storage_insert_cache() can be elided entirely if > cacheit_lockit is a false constant expression. > > [...] Here is the summary with links: - [bpf-next,v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup() https://git.kernel.org/bpf/bpf-next/c/68bc61c26cac You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html