Re: [RFC PATCH bpf-next 0/2] bpf: Implement shared persistent fast(er) sk_storoage mode

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 24 Aug 2021 15:15:22 -0700

On Tue, Aug 24, 2021 at 09:03:20AM -0700, sdf@xxxxxxxxxx wrote:
> On 08/23, Alexei Starovoitov wrote:
> > On Mon, Aug 23, 2021 at 05:52:50PM -0400, Hans Montero wrote:
> > > From: Hans Montero <hjm2133@xxxxxxxxxxxx>
> > >
> > > This patch set adds a BPF local storage optimization. The first patch
> > adds the
> > > feature, and the second patch extends the bpf selftests so that the
> > feature is
> > > tested.
> > >
> > > We are running BPF programs for each egress packet and noticed that
> > > bpf_sk_storage_get incurs a significant amount of cpu time. By
> > inlining the
> > > storage into struct sock and accessing that instead of performing a
> > map lookup,
> > > we expect to reduce overhead for our specific use-case.
> 
> > Looks like a hack to me. Please share the perf numbers and setup details.
> > I think there should be a different way to address performance concerns
> > without going into such hacks.
> 
> What kind of perf numbers would you like to see? What we see here is
> that bpf_sk_storage_get() cycles are somewhere on par with hashtable
> lookups (we've moved off of 5-tuple ht lookup to sk_storage). Looking
> at the code, it seems it's mostly coming from the following:
> 
>   sk_storage = rcu_dereference(sk->sk_bpf_storage);
>   sdata = rcu_dereference(local_storage->cache[smap->cache_idx]);
>   return sdata->data
> 
> We do 3 cold-cache references :-( 

Only if the prog doesn't read anything at all through 'sk' pointer,
but sounds like the bpf logic is accessing it, so for a system with millions
of sockets the first access to 'sk' will pay that penalty.
I suspect if you rearrange bpf prog to do sk->foo first the cache
miss will move and bpf_sk_storage_get() won't be hot anymore.
That's why optimizing loads like this without considering the full
picture might not achieve the desired result at the end.

> This is where the idea of inlining
> something in the socket itself came from. The RFC is just to present
> the case and discuss. I was thinking about doing some kind of
> inlining at runtime (and fallback to non-inlined case) but wanted
> to start with discussing this compile-time option first.
> 
> We can also try to save sdata somewhere in the socket to avoid two
> lookups for the cached case, this can potentially save us two
> rcu_dereference's.
> Is that something that looks acceptable? 

Not until there is clear 'perf report' where issue is obvious.

> I was wondering whether you've
> considered any socket storage optimizations on your side?

Quite the opposite. We've refactored several bpf progs to use
local storage instead of hash maps and achieved nice perf wins.

> I can try to set up some office hours to discuss in person if that's
> preferred.

Indeed, it's probably the best do discuss it on a call.