On Thu, Jun 9, 2022 at 8:08 AM John Fastabend <john.fastabend@xxxxxxxxx> wrote: > > Cong Wang wrote: > > From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > > > This patch inroduces tcp_read_skb() based on tcp_read_sock(), > > a preparation for the next patch which actually introduces > > a new sock ops. > > > > TCP is special here, because it has tcp_read_sock() which is > > mainly used by splice(). tcp_read_sock() supports partial read > > and arbitrary offset, neither of them is needed for sockmap. > > > > Cc: Eric Dumazet <edumazet@xxxxxxxxxx> > > Cc: John Fastabend <john.fastabend@xxxxxxxxx> > > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > > Cc: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> > > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > --- > > include/net/tcp.h | 2 ++ > > net/ipv4/tcp.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 49 insertions(+) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 1e99f5c61f84..878544d0f8f9 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); > > /* Read 'sendfile()'-style from a TCP socket */ > > int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > sk_read_actor_t recv_actor); > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > + sk_read_actor_t recv_actor); > > > > void tcp_initialize_rcv_mss(struct sock *sk); > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 9984d23a7f3e..a18e9ababf54 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > } > > EXPORT_SYMBOL(tcp_read_sock); > > > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > + sk_read_actor_t recv_actor) > > +{ > > + struct tcp_sock *tp = tcp_sk(sk); > > + u32 seq = tp->copied_seq; > > + struct sk_buff *skb; > > + int copied = 0; > > + u32 offset; > > + > > + if (sk->sk_state == TCP_LISTEN) > > + return -ENOTCONN; > > + > > + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > > + int used; > > + > > + __skb_unlink(skb, &sk->sk_receive_queue); > > + used = recv_actor(desc, skb, 0, skb->len); > > + if (used <= 0) { > > + if (!copied) > > + copied = used; > > + break; > > + } > > + seq += used; > > + copied += used; > > + > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { > > + kfree_skb(skb); > > Hi Cong, can you elaborate here from v2 comment. > > "Hm, it is tricky here, we use the skb refcount after this patchset, so > it could be a real drop from another kfree_skb() in net/core/skmsg.c > which initiates the drop." Sure. This is the source code of consume_skb(): 911 void consume_skb(struct sk_buff *skb) 912 { 913 if (!skb_unref(skb)) 914 return; 915 916 trace_consume_skb(skb); 917 __kfree_skb(skb); 918 } and this is kfree_skb (or kfree_skb_reason()): 770 void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) 771 { 772 if (!skb_unref(skb)) 773 return; 774 775 DEBUG_NET_WARN_ON_ONCE(reason <= 0 || reason >= SKB_DROP_REASON_MAX); 776 777 trace_kfree_skb(skb, __builtin_return_address(0), reason); 778 __kfree_skb(skb); 779 } So, both do refcnt before tracing, very clearly. Now, let's do a simple case: tcp_read_skb(): -> tcp_recv_skb() // Let's assume skb refcnt == 1 here -> recv_actor() -> skb_get() // refcnt == 2 -> kfree_skb() // Let's assume users drop it intentionally ->kfree_skb() // refcnt == 0 here, if we had consume_skb() it would not be counted as a drop Of course you can give another example where consume_skb() is correct, but the point here is it is very tricky when refcnt, I even doubt we can do anything here, maybe moving trace before refcnt. > > The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going > to kick tracing infra even on good cases with kfree_skb()? In > sk_psock_verdict_recv() we do an skb_clone() there. I don't get your point here, are you suggesting we should sacrifice performance just to make the drop tracing more accurate?? Thanks.