Patch "bpf, sockmap: TCP data stall on recv before accept" has been added to the 5.15-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Thu, 9 May 2024 13:40:37 -0400

This is a note to let you know that I've just added the patch titled

    bpf, sockmap: TCP data stall on recv before accept

to the 5.15-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     bpf-sockmap-tcp-data-stall-on-recv-before-accept.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 571b906d7d24a275334d64a47b43b62da302017a
Author: John Fastabend <john.fastabend@xxxxxxxxx>
Date:   Mon May 22 19:56:10 2023 -0700

    bpf, sockmap: TCP data stall on recv before accept
    
    [ Upstream commit ea444185a6bf7da4dd0df1598ee953e4f7174858 ]
    
    A common mechanism to put a TCP socket into the sockmap is to hook the
    BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
    that can map the socket info to the correct BPF verdict parser. When
    the user adds the socket to the map the psock is created and the new
    ops are assigned to ensure the verdict program will 'see' the sk_buffs
    as they arrive.
    
    Part of this process hooks the sk_data_ready op with a BPF specific
    handler to wake up the BPF verdict program when data is ready to read.
    The logic is simple enough (posted here for easy reading)
    
     static void sk_psock_verdict_data_ready(struct sock *sk)
     {
            struct socket *sock = sk->sk_socket;
    
            if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
                    return;
            sock->ops->read_skb(sk, sk_psock_verdict_recv);
     }
    
    The oversight here is sk->sk_socket is not assigned until the application
    accepts() the new socket. However, its entirely ok for the peer application
    to do a connect() followed immediately by sends. The socket on the receiver
    is sitting on the backlog queue of the listening socket until its accepted
    and the data is queued up. If the peer never accepts the socket or is slow
    it will eventually hit data limits and rate limit the session. But,
    important for BPF sockmap hooks when this data is received TCP stack does
    the sk_data_ready() call but the read_skb() for this data is never called
    because sk_socket is missing. The data sits on the sk_receive_queue.
    
    Then once the socket is accepted if we never receive more data from the
    peer there will be no further sk_data_ready calls and all the data
    is still on the sk_receive_queue(). Then user calls recvmsg after accept()
    and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
    The handler checks for data in the sk_msg ingress queue expecting that
    the BPF program has already run from the sk_data_ready hook and enqueued
    the data as needed. So we are stuck.
    
    To fix do an unlikely check in recvmsg handler for data on the
    sk_receive_queue and if it exists wake up data_ready. We have the sock
    locked in both read_skb and recvmsg so should avoid having multiple
    runners.
    
    Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
    Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx>
    Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
    Reviewed-by: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx>
    Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@xxxxxxxxx
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e3a9477293ce4..89204004eeb5e 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -195,6 +195,26 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
 
 	lock_sock(sk);
+
+	/* We may have received data on the sk_receive_queue pre-accept and
+	 * then we can not use read_skb in this context because we haven't
+	 * assigned a sk_socket yet so have no link to the ops. The work-around
+	 * is to check the sk_receive_queue and in these cases read skbs off
+	 * queue again. The read_skb hook is not running at this point because
+	 * of lock_sock so we avoid having multiple runners in read_skb.
+	 */
+	if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) {
+		tcp_data_ready(sk);
+		/* This handles the ENOMEM errors if we both receive data
+		 * pre accept and are already under memory pressure. At least
+		 * let user know to retry.
+		 */
+		if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) {
+			copied = -EAGAIN;
+			goto out;
+		}
+	}
+
 msg_bytes_ready:
 	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
 	if (!copied) {