On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote: > > Incorrect accounting fwd_alloc can result in a warning when the socket > is torn down, > > [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230 > [...] > [18455.319543] Call Trace: > [18455.319556] inet_csk_destroy_sock+0xba/0x1f0 > [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380 > [18455.319593] ? lock_downgrade+0x3a0/0x3a0 > [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0 > [18455.319631] ? sk_reset_timer+0x15/0x70 > [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240 > [18455.319663] ? lock_release+0xb2/0x3f0 > [18455.319676] ? __release_sock+0x8a/0x1b0 > [18455.319690] ? lock_downgrade+0x3a0/0x3a0 > [18455.319704] ? lock_release+0x3f0/0x3f0 > [18455.319717] ? __tcp_close+0x2c6/0x790 > [18455.319736] ? tcp_v4_do_rcv+0x168/0x370 > [18455.319750] tcp_v4_do_rcv+0x168/0x370 > [18455.319767] __release_sock+0xbc/0x1b0 > [18455.319785] __tcp_close+0x2ee/0x790 > [18455.319805] tcp_close+0x20/0x80 > > This currently happens because on redirect case we do skb_set_owner_r() > with the original sock. This increments the fwd_alloc memory accounting > on the original sock. Then on redirect we may push this into the queue > of the psock we are redirecting to. When the skb is flushed from the > queue we give the memory back to the original sock. The problem is if > the original sock is destroyed/closed with skbs on another psocks queue > then the original sock will not have a way to reclaim the memory before > being destroyed. Then above warning will be thrown > > sockA sockB > > sk_psock_strp_read() > sk_psock_verdict_apply() > -- SK_REDIRECT -- > sk_psock_skb_redirect() > skb_queue_tail(psock_other->ingress_skb..) > > sk_close() > sock_map_unref() > sk_psock_put() > sk_psock_drop() > sk_psock_zap_ingress() > > At this point we have torn down our own psock, but have the outstanding > skb in psock_other. Note that SK_PASS doesn't have this problem because > the sk_psock_drop() logic releases the skb, its still associated with > our psock. > > To resolve lets only account for sockets on the ingress queue that are > still associated with the current socket. On the redirect case we will > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked > or received with sk_psock_skb_ingress memory will be claimed on psock_other. You mean sk_psock_skb_ingress(), right? > > Reported-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self") > Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx> > --- > net/core/skmsg.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 1261512d6807..f150b5b63561 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb > if (unlikely(!msg)) > return -EAGAIN; > sk_msg_init(msg); > + skb_set_owner_r(skb, sk); > return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg); > } > > @@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int > { > switch (verdict) { > case __SK_REDIRECT: > - skb_set_owner_r(skb, sk); > sk_psock_skb_redirect(skb); > break; > case __SK_PASS: > @@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb) > rcu_read_lock(); > prog = READ_ONCE(psock->progs.skb_verdict); > if (likely(prog)) { > - /* We skip full set_owner_r here because if we do a SK_PASS > - * or SK_DROP we can skip skb memory accounting and use the > - * TLS context. > - */ > skb->sk = psock->sk; > tcp_skb_bpf_redirect_clear(skb); > ret = sk_psock_bpf_run(psock, prog, skb); > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) > kfree_skb(skb); > goto out; > } > - skb_set_owner_r(skb, sk); > prog = READ_ONCE(psock->progs.skb_verdict); > if (likely(prog)) { > + skb->sk = psock->sk; Why is skb_orphan() not needed here? Nit: You can just use 'sk' here, so "skb->sk = sk". > tcp_skb_bpf_redirect_clear(skb); > ret = sk_psock_bpf_run(psock, prog, skb); > ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); > + skb->sk = NULL; Why do you want to set it to NULL here? Thanks.