Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Wed, 05 Feb 2025 21:56:18 -0500

Jason Xing wrote:
> On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:
> >
> > On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> > >
> > > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> > > >>
> > > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > > >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > > >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > > >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > > >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > > >>>> +
> > > >>>> +            tcb->txstamp_ack_bpf = 1;
> > > >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> > > >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > > >>>> +    }
> > > >>>
> > > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > > >>> right?
> > > >>
> > > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > > >
> > > > Martin, I'm afraid it's not like what you expect. Only the last
> > > > portion of the sendmsg will enter the above function which means if
> > > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > > and be timestamped.
> > >
> > > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> > >
> > > My point is, only attaching a bpf alone is not enough. The
> > > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
> >
> > Right.
> >
> > >
> > > >
> > > >>
> > > >>>
> > > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > > >>> interested in tracing current packet all the way thru the stack?
> > > >>
> > > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > > >> particular socket.
> > > >>
> > > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > > >> already has another usage, which its return value is currently enforced by the
> > > >> verifier. It is better not to convolute it further.
> > > >>
> > > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > > >>
> > > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > > >
> > > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > > experience from our company about how to trace all the skbs which were
> > > > completed through a kernel module. It's how we use in production
> > > > especially for debug or diagnose use.
> > >
> > > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > > why move the bit setting into kfunc makes the whole set won't work.
> > >
> > > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > > are some functions that I can take as good examples?
> > > >
> > > > I think it's a standalone and good feature, can I handle it after this series?
> > >
> > > Unfortunately, no. Once the default is on, this cannot be changed.
> > >
> > > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > > is useful, so I suggested a way to do it.
> >
> > Because, sorry, I don't want to postpone this series any longer (blame
> > on me for delaying almost 4 months), only wanting to focus on the
> > extension for SO_TIMESTAMPING so that we can quickly move on with
> > small changes per series.
> >
> > Selectively sampling the skbs or sampling all the skbs could be an
> > optional good choice/feature for users instead of mandatory?
> >
> > There are two kinds of monitoring in production: 1) daily monitoring,
> > 2) diagnostic monitoring which I'm not sure if I translate in the
> > right way. For the former that is obviously a light-weight feature, I
> > think we don't need to trace that many skbs, only the last skb is
> > enough which was done in Google because even the selective feature[1]
> > is a little bit heavy. I received some complaints from a few
> > latency-sensitive customers to ask us if we can reduce the monitoring
> > in the kernel because as I mentioned before many issues are caused by
> > the application itself instead of kernel.
> >
> > [1] selective feature consists of two parts, only selectively
> > collecting all the skbs in a certain period or selectively collecting
> > exactly like what SO_TIMESTAMPING does in a certain period. It might
> > need a full discussion, I reckon.
> 
> I presume you might refer to the former. It works like the cmsg
> feature which can be a good selectively sampling example. It would be
> better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
> callback which is nearly the very beginning of each sendmsg syscall
> because I have a hunch we will add more hook points before skb enters
> the qdisc.
> 
> I think we can split the whole idea into two parts: for now, because
> of the current series implementing the same function as SO_TIMETAMPING
> does, I will implement the selective sample feature in the series.
> After someday we finish tracing all the skb, then we will add the
> corresponding selective sample feature.

Are you saying that you will include selective sampling now or want to
postpone it?

Jakub brought up a great point. Our continuous deployment would not be
feasible without sampling. Indeed implemented using cmsg.

I think it should be included from the initial patch series.

> But the default mode is the exact same as SO_TIMESTAMPING instead of
> asking bpf prog to enable the sample feature. Does it make sense to
> you?
> 
> With that said, the patch looks like this:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1f528e63bc71..73909dad7ed4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -497,11 +497,14 @@ static void tcp_tx_timestamp(struct sock *sk,
> struct sockcm_cookie *sockc)
>             SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
>                 struct skb_shared_info *shinfo = skb_shinfo(skb);
>                 struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> +               bool enable_sample = true;
> 
> -               tcb->txstamp_ack_bpf = 1;
> -               shinfo->tx_flags |= SKBTX_BPF;
> -               shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> -               bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
> +               enable_sample = bpf_skops_tx_timestamping(sk, skb,
> BPF_SOCK_OPS_TS_SND_CB);
> +               if (enable_sample) {
> +                       tcb->txstamp_ack_bpf = 1;
> +                       shinfo->tx_flags |= SKBTX_BPF;
> +                       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +               }
>         }
>  }
> 
> Thanks,
> Jason