Re: [Patch bpf v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Thu, 1 Jul 2021 11:00:35 -0700

On Thu, Jul 1, 2021 at 9:23 AM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> >
> > Jiang observed OOM frequently when testing our AF_UNIX/UDP
> > proxy. This is due to the fact that we do not actually limit
> > the socket memory before queueing skb to ingress_skb. We
> > charge the skb memory later when handling the psock backlog,
> > but it is not limited either.
>
> Right, its not limiting but charging it should push back on
> the stack so it stops feeding skbs to us. Maybe this doesn't
> happen in UDP side?

The OOM is due to skb queued in ingress_skb, not due to
user-space consuming skb slowly.

>
> >
> > This patch adds checks for sk->sk_rcvbuf right before queuing
> > to ingress_skb and drops packets if this limit exceeds. This
> > is very similar to UDP receive path. Ideally we should set the
> > skb owner before this check too, but it is hard to make TCP
> > happy about sk_forward_alloc.
>
> But it breaks TCP side see below.
>
> >
> > Reported-by: Jiang Wang <jiang.wang@xxxxxxxxxxxxx>
> > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> > Cc: John Fastabend <john.fastabend@xxxxxxxxx>
> > Cc: Lorenz Bauer <lmb@xxxxxxxxxxxxxx>
> > Cc: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx>
> > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> > ---
> >  net/core/skmsg.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 9b6160a191f8..a5185c781332 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
> >               return -EIO;
> >       }
> >       spin_lock_bh(&psock_other->ingress_lock);
> > -     if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
> > +     if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
> > +         atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
> >               spin_unlock_bh(&psock_other->ingress_lock);
> >               skb_bpf_redirect_clear(skb);
> >               sock_drop(from->sk, skb);
> > @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
> >               }
> >               if (err < 0) {
> >                       spin_lock_bh(&psock->ingress_lock);
> > -                     if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> > +                     if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
> > +                         atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
> >                               skb_queue_tail(&psock->ingress_skb, skb);
>
> We can't just drop the packet in the memory overrun case here. This will
> break TCP because the data will be gone and no one will retransmit.
>
> Thats why in the current scheme on redirect we can push back when we
> move it to the other queues ingress message queue or redirect into
> the other socket via send.
>
> At one point I considered charging the data sitting in the ingress_skb?
> Would that solve the problem here? I think it would cause the enqueue
> at the UDP to start dropping packets from __udp_enqueue_schedule_skb()?

I tried to move skb_set_owner_r() here, TCP is clearly unhappy about it,
as I explained in changelog. Yes, it probably helps if we could move it here.

Thanks.