On 5/25/23 7:41 PM, Kuniyuki Iwashima wrote:
From: Lorenz Bauer <lmb@xxxxxxxxxxxxx>
Date: Thu, 25 May 2023 09:19:22 +0100
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy, which
configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
TPROXY from our setup.
The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
helpers don't execute SK_REUSEPORT programs. Instead, one of the
reuseport sockets is selected by hash. This could cause dispatch to the
"wrong" socket:
sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.
Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.
Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe6 ("bpf: Add socket assign support")
Please use 12 chars of hash.
$ cat ~/.gitconfig
[core]
abbrev = 12
[pretty]
fixes = Fixes: %h (\"%s\")
$ git show 8e368dc --pretty=fixes | head -n 1
Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Yeap, not quite sure what happened here but the 12 chars is clear. Will
be fixed up in v2, too, ofc.
Co-developed-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
Signed-off-by: Lorenz Bauer <lmb@xxxxxxxxxxxxx>
Cc: Joe Stringer <joe@xxxxxxxxx>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@xxxxxxxxxxxxxx/
---
include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
include/net/inet_hashtables.h | 27 +++++++++++++++++++++++--
include/net/sock.h | 7 +++++--
include/uapi/linux/bpf.h | 3 ---
net/core/filter.c | 2 --
net/ipv4/inet_hashtables.c | 15 +++++++-------
net/ipv4/udp.c | 23 +++++++++++++++++++---
net/ipv6/inet6_hashtables.c | 19 +++++++++---------
net/ipv6/udp.c | 23 +++++++++++++++++++---
tools/include/uapi/linux/bpf.h | 3 ---
10 files changed, 119 insertions(+), 39 deletions(-)
[...]
@@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
int iif, int sdif,
bool *refcounted)
{
- struct sock *sk = skb_steal_sock(skb, refcounted);
-
+ bool prefetched;
+ struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
+ struct net *net = dev_net(skb_dst(skb)->dev);
+ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
nit: Reverse Xmas Tree order. Same for other chunks.
It is, the prefetched bool is simply used one line below. I don't think
this is much different than most other code from style pov..
+
+ if (prefetched) {
+ struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
+ &ip6h->saddr, sport,
+ &ip6h->daddr, ntohs(dport));
+ if (reuse_sk) {
+ if (reuse_sk != sk) {
+ if (*refcounted) {
+ sock_put(sk);
+ *refcounted = false;
+ }
+ if (IS_ERR(reuse_sk))
+ return NULL;
+ }
+ return reuse_sk;
+ }
Maybe we can add a hepler to avoid this duplication ?
We'll check if it can be made a bit nicer and integrate this into the v2.
Thanks,
Daniel