On Tue, Aug 24, 2021 at 09:03:20AM -0700, sdf@xxxxxxxxxx wrote: > On 08/23, Alexei Starovoitov wrote: > > On Mon, Aug 23, 2021 at 05:52:50PM -0400, Hans Montero wrote: > > > From: Hans Montero <hjm2133@xxxxxxxxxxxx> > > > > > > This patch set adds a BPF local storage optimization. The first patch > > adds the > > > feature, and the second patch extends the bpf selftests so that the > > feature is > > > tested. > > > > > > We are running BPF programs for each egress packet and noticed that > > > bpf_sk_storage_get incurs a significant amount of cpu time. By > > inlining the > > > storage into struct sock and accessing that instead of performing a > > map lookup, > > > we expect to reduce overhead for our specific use-case. > > > Looks like a hack to me. Please share the perf numbers and setup details. > > I think there should be a different way to address performance concerns > > without going into such hacks. > > What kind of perf numbers would you like to see? What we see here is > that bpf_sk_storage_get() cycles are somewhere on par with hashtable > lookups (we've moved off of 5-tuple ht lookup to sk_storage). Looking > at the code, it seems it's mostly coming from the following: > > sk_storage = rcu_dereference(sk->sk_bpf_storage); > sdata = rcu_dereference(local_storage->cache[smap->cache_idx]); > return sdata->data > > We do 3 cold-cache references :-( Only if the prog doesn't read anything at all through 'sk' pointer, but sounds like the bpf logic is accessing it, so for a system with millions of sockets the first access to 'sk' will pay that penalty. I suspect if you rearrange bpf prog to do sk->foo first the cache miss will move and bpf_sk_storage_get() won't be hot anymore. That's why optimizing loads like this without considering the full picture might not achieve the desired result at the end. > This is where the idea of inlining > something in the socket itself came from. The RFC is just to present > the case and discuss. I was thinking about doing some kind of > inlining at runtime (and fallback to non-inlined case) but wanted > to start with discussing this compile-time option first. > > We can also try to save sdata somewhere in the socket to avoid two > lookups for the cached case, this can potentially save us two > rcu_dereference's. > Is that something that looks acceptable? Not until there is clear 'perf report' where issue is obvious. > I was wondering whether you've > considered any socket storage optimizations on your side? Quite the opposite. We've refactored several bpf progs to use local storage instead of hash maps and achieved nice perf wins. > I can try to set up some office hours to discuss in person if that's > preferred. Indeed, it's probably the best do discuss it on a call.