Cong Wang wrote: > On Thu, Jun 9, 2022 at 8:08 AM John Fastabend <john.fastabend@xxxxxxxxx> wrote: > > > > Cong Wang wrote: > > > From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > > > > > This patch inroduces tcp_read_skb() based on tcp_read_sock(), > > > a preparation for the next patch which actually introduces > > > a new sock ops. > > > > > > TCP is special here, because it has tcp_read_sock() which is > > > mainly used by splice(). tcp_read_sock() supports partial read > > > and arbitrary offset, neither of them is needed for sockmap. > > > > > > Cc: Eric Dumazet <edumazet@xxxxxxxxxx> > > > Cc: John Fastabend <john.fastabend@xxxxxxxxx> > > > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > > > Cc: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> > > > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > > --- > > > include/net/tcp.h | 2 ++ > > > net/ipv4/tcp.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 49 insertions(+) > > > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > > index 1e99f5c61f84..878544d0f8f9 100644 > > > --- a/include/net/tcp.h > > > +++ b/include/net/tcp.h > > > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); > > > /* Read 'sendfile()'-style from a TCP socket */ > > > int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > > sk_read_actor_t recv_actor); > > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > > + sk_read_actor_t recv_actor); > > > > > > void tcp_initialize_rcv_mss(struct sock *sk); > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > > index 9984d23a7f3e..a18e9ababf54 100644 > > > --- a/net/ipv4/tcp.c > > > +++ b/net/ipv4/tcp.c > > > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > > } > > > EXPORT_SYMBOL(tcp_read_sock); > > > > > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > > + sk_read_actor_t recv_actor) > > > +{ > > > + struct tcp_sock *tp = tcp_sk(sk); > > > + u32 seq = tp->copied_seq; > > > + struct sk_buff *skb; > > > + int copied = 0; > > > + u32 offset; > > > + > > > + if (sk->sk_state == TCP_LISTEN) > > > + return -ENOTCONN; > > > + > > > + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > > > + int used; > > > + > > > + __skb_unlink(skb, &sk->sk_receive_queue); > > > + used = recv_actor(desc, skb, 0, skb->len); > > > + if (used <= 0) { > > > + if (!copied) > > > + copied = used; > > > + break; > > > + } > > > + seq += used; > > > + copied += used; > > > + > > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { > > > + kfree_skb(skb); > > > > Hi Cong, can you elaborate here from v2 comment. > > > > "Hm, it is tricky here, we use the skb refcount after this patchset, so > > it could be a real drop from another kfree_skb() in net/core/skmsg.c > > which initiates the drop." > > Sure. > > This is the source code of consume_skb(): > > 911 void consume_skb(struct sk_buff *skb) > 912 { > 913 if (!skb_unref(skb)) > 914 return; > 915 > 916 trace_consume_skb(skb); > 917 __kfree_skb(skb); > 918 } > > and this is kfree_skb (or kfree_skb_reason()): > > 770 void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) > 771 { > 772 if (!skb_unref(skb)) > 773 return; > 774 > 775 DEBUG_NET_WARN_ON_ONCE(reason <= 0 || reason >= > SKB_DROP_REASON_MAX); > 776 > 777 trace_kfree_skb(skb, __builtin_return_address(0), reason); > 778 __kfree_skb(skb); > 779 } > > So, both do refcnt before tracing, very clearly. > > Now, let's do a simple case: > > tcp_read_skb(): > -> tcp_recv_skb() // Let's assume skb refcnt == 1 here > -> recv_actor() > -> skb_get() // refcnt == 2 > -> kfree_skb() // Let's assume users drop it intentionally > ->kfree_skb() // refcnt == 0 here, if we had consume_skb() it would > not be counted as a drop OK great thanks for that it matches what I was thinking as well. > > Of course you can give another example where consume_skb() is > correct, but the point here is it is very tricky when refcnt, I even doubt > we can do anything here, maybe moving trace before refcnt. Considering, the other case where we do kfree_skb when consume_skb() is correct. We have logic in the Cilium tracing tools (tetragon) to trace kfree_skb's and count them. So in the good case here we end up tripping that logic even though its expected. The question is which is better noisy kfree_skb even when expected or missing kfree_skb on the drops. I'm leaning to consume_skb() is safer instead of noisy kfree_skb(). > > > > > The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going > > to kick tracing infra even on good cases with kfree_skb()? In > > sk_psock_verdict_recv() we do an skb_clone() there. > > I don't get your point here, are you suggesting we should sacrifice > performance just to make the drop tracing more accurate?? No lets not sacrifice the performance. I'm suggesting I would rather go with skb_consume() and miss some kfree_skb() than the other way around and have extra kfree_skb() that will trip monitoring. Does the question make sense? I guess we have to pick one. > > Thanks.