Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage

John Fastabend <john.fastabend@xxxxxxxxx> · Thu, 21 Oct 2021 12:24:50 -0700

Jakub Sitnicki wrote:
> On Wed, Oct 20, 2021 at 05:51 PM CEST, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> >> > Jakub Sitnicki wrote:
> >> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> >> >> > We do not need to handle unhash from BPF side we can simply wait for the
> >> >> > close to happen. The original concern was a socket could transition from
> >> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
> >> >> > But, we convinced ourself this is no longer possible and we also
> >> >> > improved BPF sockmap to handle listen sockets so this is no longer a
> >> >> > problem.
> >> >> >
> >> >> > More importantly though there are cases where unhash is called when data is
> >> >> > in the receive queue. The BPF unhash logic will flush this data which is
> >> >> > wrong. To be correct it should keep the data in the receive queue and allow
> >> >> > a receiving application to continue reading the data. This may happen when
> >> >> > tcp_abort is received for example. Instead of complicating the logic in
> >> >> > unhash simply moving all this to tcp_close hook solves this.
> >> >> >
> >> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> >> >> > Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx>
> >> >> > ---
> >> >>
> >> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
> >> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> >> >> close it?
> >> >

[...]

 >> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
> >> a few things come to mind:
> >
> > I think what makes most sense is to do the minimal work to fix the
> > described issue for bpf tree without introducing new issues and
> > then do the consistency/better cases in bpf-next.
> >
> >>
> >> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
> >>    won't allow it. However, with this change we will be able to have a
> >>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
> >>    inserting TCP sockets in TCP_CLOSE state should be allowed for
> >>    consistency.
> >
> > I agree, but would hold off on this for bpf-next. I missed points
> > 2,3 though in this series.
> 
> OK, that makes sense.
> 
> >>
> >> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
> >>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
> >>    TCP_CLOSE state). Today we rely on the fact there that you can't
> >>    insert a TCP_CLOSE socket.
> >
> > This should be minimal change, just change the logic to allow only
> > TCP_LISTEN.
> >
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> >                 return -EINVAL;
> >         if (unlikely(sk && sk_is_refcounted(sk)))
> >                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> > -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> > +       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
> >                 return -ESOCKTNOSUPPORT; /* reject connected sockets */
> >
> >         /* Check if socket is suitable for packet L3/L4 protocol */
> >
> >
> 
> Yeah, it shouldn't be hard. But we need to cover UDP as well. Something
> along the lines of:
> 
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10402,8 +10402,10 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>                 return -EINVAL;
>         if (unlikely(sk && sk_is_refcounted(sk)))
>                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> -               return -ESOCKTNOSUPPORT; /* reject connected sockets */
> +       if (unlikely(sk && sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN))
> +               return -ESOCKTNOSUPPORT; /* reject closed TCP sockets */
> +       if (unlikely(sk && sk_is_udp(sk) && sk->sk_state != TCP_CLOSE))
> +               return -ESOCKTNOSUPPORT; /* reject connected UDP sockets */
> 
>         /* Check if socket is suitable for packet L3/L4 protocol */
>         if (sk && sk->sk_protocol != ctx->protocol)
> 
> We aren't testing today that that error case in sk_lookup test suite,
> because it wasn't possible to insert a TCP_CLOSE socket. So once that
> gets in, I can add coverage.
> 
> >>
> >> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
> >>    similar same case as with bpf_sk_lookup_assign() (with a slight
> >>    difference that reuseport allows dispatching to connected UDP
> >>    sockets).
> >
> > Is it needed here? There is no obvious check now.  Is ESTABLISHED
> > state OK here now?
> 
> TCP ESTABLISHED sockets are not okay. They can't join the reuseport
> group and will always hit the !reuse branch.
> 
> Re-reading the code, though, I think nothing needs to be done for the
> sk_select_reuseport() helper. TCP sockets will be detached from
> reuseport group on unhash. Hence TCP_CLOSE socket will also hit the
> !reuse branch.
> 
> CC'ing Martin just in case he wants to double-check.
> 
> >
> >>
> >> 4) Don't know exactly how checks in sockmap redirect helpers would need
> >>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
> >>    that's allowed due to a short window of opportunity that opens up
> >>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
> >>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
> >>    switched to TCP_ESTABLISHED.
> >>
> >>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
> >>    be nice to get an error from the redirect helper. If I understand
> >>    correctly, if the TCP stack drops the packet after BPF verdict has
> >>    selected a socket, only the socket owner will know about by reading
> >>    the error queue.
> >>
> >>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
> >>    either, but we don't seem to filter it out today, so the helper is
> >>    not airtight.
> >
> > Right. At the moment for sending we call do_tcp_sendpages() and this
> > has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
> > would return an error. The missing case is ingress. We currently
> > let these happen and would need a check there. I was thinking
> > of doing it in a separate patch, but could tack it on to this
> > series for completeness.
> >
> 
> Oh, yeah, right. I see now what you mean. No problem on egress.
> 
> So it's just an SK_DROP return code from bpf_sk_redirect_map() that
> could be a potential improvement.
> 
> Your call if you want to add it this series. Patching it up as a follow
> up works for me as well.
> 
> >>
> >> All in all, sounds like an API change when it comes to corner cases, in
> >> addition to being a fix for the receive queue flush issue which you
> >> explained in the patch description. If possible, would push it through
> >> bpf-next.
> >
> > I think if we address 2,3,4 then we can fix the described issue
> > without introducing new cases. And then 1 is great for consistency
> > but can go via bpf-next?
> 
> So (3) is out, reuseport+sockmap users should be unaffected by this.
> 
> If you could patch (2) that would be great. We rely on this, and I can't
> assume that nobody isn't disconnecting their listener sockets for some
> reason.

Yep I'll roll a new version with a fix for this and leave the rest for
a follow up.

> 
> (4) and (1) can follow later, if you ask me.

Agree.