Re: [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Wed, 24 Mar 2021 17:44:24 -0700

On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> Incorrect accounting fwd_alloc can result in a warning when the socket
> is torn down,
>
>  [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
>  [...]
>  [18455.319543] Call Trace:
>  [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
>  [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
>  [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
>  [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
>  [18455.319631]  ? sk_reset_timer+0x15/0x70
>  [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
>  [18455.319663]  ? lock_release+0xb2/0x3f0
>  [18455.319676]  ? __release_sock+0x8a/0x1b0
>  [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
>  [18455.319704]  ? lock_release+0x3f0/0x3f0
>  [18455.319717]  ? __tcp_close+0x2c6/0x790
>  [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
>  [18455.319750]  tcp_v4_do_rcv+0x168/0x370
>  [18455.319767]  __release_sock+0xbc/0x1b0
>  [18455.319785]  __tcp_close+0x2ee/0x790
>  [18455.319805]  tcp_close+0x20/0x80
>
> This currently happens because on redirect case we do skb_set_owner_r()
> with the original sock. This increments the fwd_alloc memory accounting
> on the original sock. Then on redirect we may push this into the queue
> of the psock we are redirecting to. When the skb is flushed from the
> queue we give the memory back to the original sock. The problem is if
> the original sock is destroyed/closed with skbs on another psocks queue
> then the original sock will not have a way to reclaim the memory before
> being destroyed. Then above warning will be thrown
>
>   sockA                          sockB
>
>   sk_psock_strp_read()
>    sk_psock_verdict_apply()
>      -- SK_REDIRECT --
>      sk_psock_skb_redirect()
>                                 skb_queue_tail(psock_other->ingress_skb..)
>
>   sk_close()
>    sock_map_unref()
>      sk_psock_put()
>        sk_psock_drop()
>          sk_psock_zap_ingress()
>
> At this point we have torn down our own psock, but have the outstanding
> skb in psock_other. Note that SK_PASS doesn't have this problem because
> the sk_psock_drop() logic releases the skb, its still associated with
> our psock.
>
> To resolve lets only account for sockets on the ingress queue that are
> still associated with the current socket. On the redirect case we will
> check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
> until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
> or received with sk_psock_skb_ingress memory will be claimed on psock_other.

You mean sk_psock_skb_ingress(), right?

>
> Reported-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
> Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx>
> ---
>  net/core/skmsg.c |   13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 1261512d6807..f150b5b63561 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>         if (unlikely(!msg))
>                 return -EAGAIN;
>         sk_msg_init(msg);
> +       skb_set_owner_r(skb, sk);
>         return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
>  }
>
> @@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int
>  {
>         switch (verdict) {
>         case __SK_REDIRECT:
> -               skb_set_owner_r(skb, sk);
>                 sk_psock_skb_redirect(skb);
>                 break;
>         case __SK_PASS:
> @@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
>         rcu_read_lock();
>         prog = READ_ONCE(psock->progs.skb_verdict);
>         if (likely(prog)) {
> -               /* We skip full set_owner_r here because if we do a SK_PASS
> -                * or SK_DROP we can skip skb memory accounting and use the
> -                * TLS context.
> -                */
>                 skb->sk = psock->sk;
>                 tcp_skb_bpf_redirect_clear(skb);
>                 ret = sk_psock_bpf_run(psock, prog, skb);
> @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
>                 kfree_skb(skb);
>                 goto out;
>         }
> -       skb_set_owner_r(skb, sk);
>         prog = READ_ONCE(psock->progs.skb_verdict);
>         if (likely(prog)) {
> +               skb->sk = psock->sk;

Why is skb_orphan() not needed here?

Nit: You can just use 'sk' here, so "skb->sk = sk".

>                 tcp_skb_bpf_redirect_clear(skb);
>                 ret = sk_psock_bpf_run(psock, prog, skb);
>                 ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
> +               skb->sk = NULL;

Why do you want to set it to NULL here?

Thanks.