Jiayuan Chen wrote: > On Tue, Dec 10, 2024 at 06:11:26PM -0800, John Fastabend wrote: > > Jiayuan Chen wrote: > > > 'sk->copied_seq' was updated in the tcp_eat_skb() function when the > > > action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, > > > the update logic for 'sk->copied_seq' was moved to > > > tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature. > > > > > > It works for a single stream_verdict scenario, as it also modified > > > 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb' > > > to remove updating 'sk->copied_seq'. > > > > > > However, for programs where both stream_parser and stream_verdict are > > > active(strparser purpose), tcp_read_sock() was used instead of > > > tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock) > > > tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated > > > updates. > > > > > > In summary, for strparser + SK_PASS, copied_seq is redundantly calculated > > > in both tcp_read_sock() and tcp_bpf_recvmsg_parser(). > > > > > > The issue causes incorrect copied_seq calculations, which prevent > > > correct data reads from the recv() interface in user-land. > > > > > > Modifying tcp_read_sock() or strparser implementation directly is > > > unreasonable, as it is widely used in other modules. > > > > > > Here, we introduce a method tcp_bpf_read_sock() to replace > > > 'sk->sk_socket->ops->read_sock' (like 'tls_build_proto()' does in > > > tls_main.c). Such replacement action was also used in updating > > > tcp_bpf_prots in tcp_bpf.c, so it's not weird. > > > (Note that checkpatch.pl may complain missing 'const' qualifier when we > > > define the bpf-specified 'proto_ops', but we have to do because we need > > > update it). > > > > > > Also we remove strparser check in tcp_eat_skb() since we implement custom > > > function tcp_bpf_read_sock() without copied_seq updating. > > > > > > Since strparser currently supports only TCP, it's sufficient for 'ops' to > > > inherit inet_stream_ops. > > > > > > In strparser's implementation, regardless of partial or full reads, > > > it completely clones the entire skb, allowing us to unconditionally > > > free skb in tcp_bpf_read_sock(). > > > > > > Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq") > > > Signed-off-by: Jiayuan Chen <mrpre@xxxxxxx> > > > > [...] > > > > > +/* The tcp_bpf_read_sock() is an alternative implementation > > > + * of tcp_read_sock(), except that it does not update copied_seq. > > > + */ > > > +static int tcp_bpf_read_sock(struct sock *sk, read_descriptor_t *desc, > > > + sk_read_actor_t recv_actor) > > > +{ > > > + struct sk_buff *skb; > > > + int copied = 0; > > > + > > > + if (sk->sk_state == TCP_LISTEN) > > > + return -ENOTCONN; > > > + > > > + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { > > > + u8 tcp_flags; > > > + int used; > > > + > > > + WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); > > > + tcp_flags = TCP_SKB_CB(skb)->tcp_flags; > > > + used = recv_actor(desc, skb, 0, skb->len); > > > > Here the skb is still on the receive_queue how does this work with > > tcp_try_coalesce()? So I believe you need to unlink before you > > call the actor which creates a bit of trouble if recv_actor > > doesn't want the entire skb. > > > > I think easier is to do similar logic to read_sock and track > > offset and len? Did I miss something. > > Thanks to John Fastabend. > > Let me explain it. > Now I only replace the read_sock handler when using strparser. > > My previous implementation added the replacement of read_sock in > sk_psock_start_strp() to more explicitly require replacement when using > strparser, for instance: > '''skmsg.c > void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock) > { > ... > sk->sk_data_ready = sk_psock_strp_data_ready; > /* Replacement */ > sk->sk_socket->ops->read_sock = tcp_bpf_read_sock; > } > ''' > > As you can see that it only works for strparser. > (The current implementation of replacement in tcp_bpf_update_proto() > achieves the same effect just not as obviously.) > > So the current implementation of recv_actor() can only be strp_recv(), > with the call stack as follows: > ''' > sk_psock_strp_data_ready > -> strp_data_ready > -> strp_read_sock > -> strp_recv > ''' > > The implementation of strp_recv() will consume all input skb. Even if it > reads part of the data, it will clone it, then place it into its own > queue, expecting the caller to release the skb. I didn't find any > logic like tcp_try_coalesce() (fix me if i miss something). So here is what I believe the flow is, sk_psock_strp_data_ready -> str_data_ready -> strp_read_sock -> sock->ops->read_sock(.., strp_recv) We both have the same idea up to here. But then the proposed data_ready() call + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { + u8 tcp_flags; + int used; + + WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); + tcp_flags = TCP_SKB_CB(skb)->tcp_flags; + used = recv_actor(desc, skb, 0, skb->len); The recv_actor here is strp_recv() all good so far. But, because that skb is still on the sk_receive_queue() the TCP stack may at the same time do tcp_data_queue -> tcp_queue_rcv -> tail = skb_peek_tail(&sk->sk_receive_queue); tcp_try_coalesce(sk, tail, skb, fragstolen) -> skb_try_coalesce() ... skb->len += len So among other things you will have changed the skb->len and added some data to it. If this happens while you are running the recv actor we will eat the data by calling tcp_eat_recv_skb(). Any data added from the try_coalesce will just be dropped and never handled? The clone() from the strparser side doesn't help you the tcp_eat_recv_skb call will unlik the skb from the sk_receive_queue. I don't think you have any way to protect this at the moment. > > The record of the 'offset' is stored within its own context(strparser/_strp_msg). > With all skbs and offset saved in strp context, the caller does not need to > maintain it. > > I have also added various logic tests for this situation in the test > case, and it works correctly. > > > + /* strparser clone and consume all input skb > > > + * even in waiting head or body status > > > + */ > > > + tcp_eat_recv_skb(sk, skb); > > > + if (used <= 0) { > > > + if (!copied) > > > + copied = used; > > > + break; > > > + } > > > + copied += used; > > > + if (!desc->count) > > > + break; > > > + if (tcp_flags & TCPHDR_FIN) > > > + break; > > > + } > > > + return copied; > > > +} > > > + > > > enum { > > > TCP_BPF_IPV4, > > > TCP_BPF_IPV6, > > > @@ -595,6 +636,10 @@ enum { > >