On Thu, May 25, 2023 at 10:19 AM Lorenz Bauer <lmb@xxxxxxxxxxxxx> wrote: > > Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT > sockets. This means we can't use the helper to steer traffic to Envoy, which > configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing > TPROXY from our setup. > > The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup > helpers don't execute SK_REUSEPORT programs. Instead, one of the > reuseport sockets is selected by hash. This could cause dispatch to the > "wrong" socket: > > sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash > bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed > > Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup > helpers unfortunately. In the tc context, L2 headers are at the start > of the skb, while SK_REUSEPORT expects L3 headers instead. > > Instead, we execute the SK_REUSEPORT program when the assigned socket > is pulled out of the skb, further up the stack. This creates some > trickiness with regards to refcounting as bpf_sk_assign will put both > refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU > freed. We can infer that the sk_assigned socket is RCU freed if the > reuseport lookup succeeds, but convincing yourself of this fact isn't > straight forward. Therefore we defensively check refcounting on the > sk_assign sock even though it's probably not required in practice. > > Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign") > Fixes: cf7fbe6 ("bpf: Add socket assign support") > Co-developed-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > Signed-off-by: Lorenz Bauer <lmb@xxxxxxxxxxxxx> > Cc: Joe Stringer <joe@xxxxxxxxx> > Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@xxxxxxxxxxxxxx/ > --- > include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++----- > include/net/inet_hashtables.h | 27 +++++++++++++++++++++++-- > include/net/sock.h | 7 +++++-- > include/uapi/linux/bpf.h | 3 --- > net/core/filter.c | 2 -- > net/ipv4/inet_hashtables.c | 15 +++++++------- > net/ipv4/udp.c | 23 +++++++++++++++++++--- > net/ipv6/inet6_hashtables.c | 19 +++++++++--------- > net/ipv6/udp.c | 23 +++++++++++++++++++--- > tools/include/uapi/linux/bpf.h | 3 --- > 10 files changed, 119 insertions(+), 39 deletions(-) > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index e7391bf310a7..920131e4a65d 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net, > return score; > } > > -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk, > - struct sk_buff *skb, int doff, > - __be32 saddr, __be16 sport, > - __be32 daddr, unsigned short hnum) > +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk, > + struct sk_buff *skb, int doff, > + __be32 saddr, __be16 sport, > + __be32 daddr, unsigned short hnum) > { > struct sock *reuse_sk = NULL; > u32 phash; > @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk, > } > return reuse_sk; > } > +EXPORT_SYMBOL_GPL(inet_lookup_reuseport); > > /* > * Here are some nice properties to exploit here. The BSD API > @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net, > sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) { > score = compute_score(sk, net, hnum, daddr, dif, sdif); > if (score > hiscore) { > - result = lookup_reuseport(net, sk, skb, doff, > - saddr, sport, daddr, hnum); > + result = inet_lookup_reuseport(net, sk, skb, doff, > + saddr, sport, daddr, hnum); > if (result) > return result; > Please split in a series. First a patch renaming lookup_reuseport() to inet_lookup_reuseport() and inet6_lookup_reuseport() (cleanup, no change in behavior) This would ease review and future bug hunting quite a bit.