Re: [Patch bpf v3] skmsg: check sk_rcvbuf limit before queuing to ingress_skb

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Wed, 13 Oct 2021 09:44:28 -0700

On Wed, Oct 13, 2021 at 7:07 AM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> >
> > Jiang observed OOM frequently when testing our AF_UNIX/UDP
> > proxy. This is due to the fact that we do not actually limit
> > the socket memory before queueing skb to ingress_skb. We
> > charge the skb memory later when handling the psock backlog,
> > and it is not limited either.
> >
> > This patch adds checks for sk->sk_rcvbuf right before queuing
> > to ingress_skb and drops or retries the packets if this limit
> > exceeds. This is very similar to UDP receive path. Ideally we
> > should set the skb owner before this check too, but it is hard
> > to make TCP happy with sk_forward_alloc.
> >
> > For TCP, we can not just drop the packets on errors. TCP ACKs
> > are already sent for those packet before reaching
> > ->sk_data_ready(). Instead, we use best effort to retry, this
> > works because TCP does not remove the skb from receive queue
> > at that point and exceeding sk_rcvbuf limit is a temporary
> > situation.
> >
> > Reported-by: Jiang Wang <jiang.wang@xxxxxxxxxxxxx>
> > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> > Cc: John Fastabend <john.fastabend@xxxxxxxxx>
> > Cc: Lorenz Bauer <lmb@xxxxxxxxxxxxxx>
> > Cc: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx>
> > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
>
> Makes sense to include a fixes tag here.
>
> > ---
> > v3: add retry logic for TCP
> > v2: add READ_ONCE()
>
> I agree this logic is needed, but I think the below is not
> complete. I can get the couple extra fixes in front of this
> today/tomorrow on my side and kick it through some testing here.
> Then we should push it as a series. Your patch + additions.

Sounds good. As long as we have this limit, it will be okay to me.

>
> >
> >  net/core/skmsg.c | 15 +++++++++------
> >  net/ipv4/tcp.c   |  2 ++
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 2d6249b28928..356c314cd60c 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
>
> All the skmsg changes are good.
>
>
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index e8b48df73c85..8b243fcdbb8f 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1665,6 +1665,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> >                       if (used <= 0) {
> >                               if (!copied)
> >                                       copied = used;
> > +                             if (used == -EAGAIN)
> > +                                     continue;
>
> This is not a good idea, looping through read_sock because we have
> hit a memory limit is not going to work. If something is holding the
> memlimit pinned this could loop indefinately.
>
> Also this will run the verdict multiple times on the same bytes. For
> apply/cork logic this will break plus just basic parsers will be
> confused when they see duplicate bytes.

Good point! I run out of ideas for dealing with this TCP case,
dropping is not okay, retrying is hard, reworking TCP ACKing
is even harder. :-/

>
> We need to do a workqueue and then retry later.
>
> Final missing piece is that strparser logic would still not handle
> this correctly.
>
> I don't mind spending some time on this today. I'll apply your
> patch and then add a few fixes for above.

Ideally, we should move TCP ACK after ->sk_data_ready()
so that dropping in ->sk_data_ready() would be fine, but this is
certainly not easy even if it is doable.

Thanks.