On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote: > On 6/9/22 3:18 AM, Jon Maxwell wrote: > > A customer reported a request_socket leak in a Calico cloud environment. We > > found that a BPF program was doing a socket lookup with takes a refcnt on > > the socket and that it was finding the request_socket but returning the parent > > LISTEN socket via sk_to_full_sk() without decrementing the child request socket > > 1st, resulting in request_sock slab object leak. This patch retains the Great catch and debug indeed! > > existing behaviour of returning full socks to the caller but it also decrements > > the child request_socket if one is present before doing so to prevent the leak. > > > > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And > > thanks to Antoine Tenart for the reproducer and patch input. > > > > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()") > > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper") Instead of the above commits, I think this dated back to 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") > > Tested-by: Curtis Taylor <cutaylor-pub@xxxxxxxxx> > > Co-developed-by: Antoine Tenart <atenart@xxxxxxxxxx> > > Signed-off-by:: Antoine Tenart <atenart@xxxxxxxxxx> > > Signed-off-by: Jon Maxwell <jmaxwell37@xxxxxxxxx> > > --- > > net/core/filter.c | 20 ++++++++++++++------ > > 1 file changed, 14 insertions(+), 6 deletions(-) > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 2e32cee2c469..e3c04ae7381f 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, > > { > > struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net, > > ifindex, proto, netns_id, flags); > > + struct sock *sk1 = sk; > > if (sk) { > > sk = sk_to_full_sk(sk); > > - if (!sk_fullsock(sk)) { > > - sock_gen_put(sk); > > + /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1 > > + * sock refcnt is decremented to prevent a request_sock leak. > > + */ > > + if (!sk_fullsock(sk1)) > > + sock_gen_put(sk1); > > + if (!sk_fullsock(sk)) In this case, sk1 == sk (timewait). It is a bit worrying to pass sk to sk_fullsock(sk) after the above sock_gen_put(). I think Daniel's 'if (sk2 != sk) { sock_gen_put(sk); }' check is better. > > [ +Martin/Joe/Lorenz ] > > I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the > bpf_sk_release() case later on? Rough pseudocode could be something like below: > > static struct sock * > __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, > struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id, > u64 flags) > { > struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net, > ifindex, proto, netns_id, flags); > if (sk) { > struct sock *sk2 = sk_to_full_sk(sk); > > if (!sk_fullsock(sk2)) > sk2 = NULL; > if (sk2 != sk) { > sock_gen_put(sk); > if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) { I don't think it matters if the helper-returned sk2 is refcounted or not (SOCK_RCU_FREE). The verifier has ensured the bpf_sk_lookup() and bpf_sk_release() are always balanced regardless of the type of sk2. bpf_sk_release() will do the right thing to check the sk2 is refcounted or not before calling sock_gen_put(). The bug here is the helper forgot to call sock_gen_put(sk) while the verifier only tracks the sk2, so I think the 'if (unlikely...) { WARN_ONCE(...); }' can be saved. > WARN_ONCE(1, "Found non-RCU, unreferenced socket!"); > sk2 = NULL; > } > } > sk = sk2; > } > return sk; > }