Jason Xing wrote: > On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote: > > > > On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > > > > > On 2/5/25 4:12 PM, Jason Xing wrote: > > > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > > >> > > > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote: > > > >>> On Wed, 5 Feb 2025 02:30:22 +0800 Jason Xing wrote: > > > >>>> + if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) && > > > >>>> + SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) { > > > >>>> + struct skb_shared_info *shinfo = skb_shinfo(skb); > > > >>>> + struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); > > > >>>> + > > > >>>> + tcb->txstamp_ack_bpf = 1; > > > >>>> + shinfo->tx_flags |= SKBTX_BPF; > > > >>>> + shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; > > > >>>> + } > > > >>> > > > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this > > > >>> right? > > > >> > > > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING > > > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed. > > > > > > > > Martin, I'm afraid it's not like what you expect. Only the last > > > > portion of the sendmsg will enter the above function which means if > > > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF > > > > and be timestamped. > > > > > > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR). > > > > > > My point is, only attaching a bpf alone is not enough. The > > > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on. > > > > Right. > > > > > > > > > > > > >> > > > >>> > > > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's > > > >>> interested in tracing current packet all the way thru the stack? > > > >> > > > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a > > > >> particular socket. > > > >> > > > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value) > > > >> already has another usage, which its return value is currently enforced by the > > > >> verifier. It is better not to convolute it further. > > > >> > > > >> I don't prefer to add more use cases to skops->reply either, which is an union > > > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value. > > > >> > > > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a > > > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set > > > >> SKBTX_BPF and txstamp_ack_bpf in some skb. > > > > > > > > Agreed because at netdev 0x19 I have an explicit plan to share the > > > > experience from our company about how to trace all the skbs which were > > > > completed through a kernel module. It's how we use in production > > > > especially for debug or diagnose use. > > > > > > This is fine. The bpf prog can still do that by calling the kfunc. I don't see > > > why move the bit setting into kfunc makes the whole set won't work. > > > > > > > I'm not knowledgeable enough about BPF, so I'd like to know if there > > > > are some functions that I can take as good examples? > > > > > > > > I think it's a standalone and good feature, can I handle it after this series? > > > > > > Unfortunately, no. Once the default is on, this cannot be changed. > > > > > > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp > > > is useful, so I suggested a way to do it. > > > > Because, sorry, I don't want to postpone this series any longer (blame > > on me for delaying almost 4 months), only wanting to focus on the > > extension for SO_TIMESTAMPING so that we can quickly move on with > > small changes per series. > > > > Selectively sampling the skbs or sampling all the skbs could be an > > optional good choice/feature for users instead of mandatory? > > > > There are two kinds of monitoring in production: 1) daily monitoring, > > 2) diagnostic monitoring which I'm not sure if I translate in the > > right way. For the former that is obviously a light-weight feature, I > > think we don't need to trace that many skbs, only the last skb is > > enough which was done in Google because even the selective feature[1] > > is a little bit heavy. I received some complaints from a few > > latency-sensitive customers to ask us if we can reduce the monitoring > > in the kernel because as I mentioned before many issues are caused by > > the application itself instead of kernel. > > > > [1] selective feature consists of two parts, only selectively > > collecting all the skbs in a certain period or selectively collecting > > exactly like what SO_TIMESTAMPING does in a certain period. It might > > need a full discussion, I reckon. > > I presume you might refer to the former. It works like the cmsg > feature which can be a good selectively sampling example. It would be > better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB > callback which is nearly the very beginning of each sendmsg syscall > because I have a hunch we will add more hook points before skb enters > the qdisc. > > I think we can split the whole idea into two parts: for now, because > of the current series implementing the same function as SO_TIMETAMPING > does, I will implement the selective sample feature in the series. > After someday we finish tracing all the skb, then we will add the > corresponding selective sample feature. Are you saying that you will include selective sampling now or want to postpone it? Jakub brought up a great point. Our continuous deployment would not be feasible without sampling. Indeed implemented using cmsg. I think it should be included from the initial patch series. > But the default mode is the exact same as SO_TIMESTAMPING instead of > asking bpf prog to enable the sample feature. Does it make sense to > you? > > With that said, the patch looks like this: > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1f528e63bc71..73909dad7ed4 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -497,11 +497,14 @@ static void tcp_tx_timestamp(struct sock *sk, > struct sockcm_cookie *sockc) > SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) { > struct skb_shared_info *shinfo = skb_shinfo(skb); > struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); > + bool enable_sample = true; > > - tcb->txstamp_ack_bpf = 1; > - shinfo->tx_flags |= SKBTX_BPF; > - shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; > - bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB); > + enable_sample = bpf_skops_tx_timestamping(sk, skb, > BPF_SOCK_OPS_TS_SND_CB); > + if (enable_sample) { > + tcb->txstamp_ack_bpf = 1; > + shinfo->tx_flags |= SKBTX_BPF; > + shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; > + } > } > } > > Thanks, > Jason