Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work

Jason Xing <kerneljasonxing@xxxxxxxxx> · Thu, 6 Feb 2025 10:39:05 +0800

On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:
>
> On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> >
> > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> > >>
> > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > >>>> +
> > >>>> +            tcb->txstamp_ack_bpf = 1;
> > >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> > >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > >>>> +    }
> > >>>
> > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > >>> right?
> > >>
> > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > >
> > > Martin, I'm afraid it's not like what you expect. Only the last
> > > portion of the sendmsg will enter the above function which means if
> > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > and be timestamped.
> >
> > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> >
> > My point is, only attaching a bpf alone is not enough. The
> > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
>
> Right.
>
> >
> > >
> > >>
> > >>>
> > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > >>> interested in tracing current packet all the way thru the stack?
> > >>
> > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > >> particular socket.
> > >>
> > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > >> already has another usage, which its return value is currently enforced by the
> > >> verifier. It is better not to convolute it further.
> > >>
> > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > >>
> > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > >
> > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > experience from our company about how to trace all the skbs which were
> > > completed through a kernel module. It's how we use in production
> > > especially for debug or diagnose use.
> >
> > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > why move the bit setting into kfunc makes the whole set won't work.
> >
> > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > are some functions that I can take as good examples?
> > >
> > > I think it's a standalone and good feature, can I handle it after this series?
> >
> > Unfortunately, no. Once the default is on, this cannot be changed.
> >
> > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > is useful, so I suggested a way to do it.
>
> Because, sorry, I don't want to postpone this series any longer (blame
> on me for delaying almost 4 months), only wanting to focus on the
> extension for SO_TIMESTAMPING so that we can quickly move on with
> small changes per series.
>
> Selectively sampling the skbs or sampling all the skbs could be an
> optional good choice/feature for users instead of mandatory?
>
> There are two kinds of monitoring in production: 1) daily monitoring,
> 2) diagnostic monitoring which I'm not sure if I translate in the
> right way. For the former that is obviously a light-weight feature, I
> think we don't need to trace that many skbs, only the last skb is
> enough which was done in Google because even the selective feature[1]
> is a little bit heavy. I received some complaints from a few
> latency-sensitive customers to ask us if we can reduce the monitoring
> in the kernel because as I mentioned before many issues are caused by
> the application itself instead of kernel.
>
> [1] selective feature consists of two parts, only selectively
> collecting all the skbs in a certain period or selectively collecting
> exactly like what SO_TIMESTAMPING does in a certain period. It might
> need a full discussion, I reckon.

I presume you might refer to the former. It works like the cmsg
feature which can be a good selectively sampling example. It would be
better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
callback which is nearly the very beginning of each sendmsg syscall
because I have a hunch we will add more hook points before skb enters
the qdisc.

I think we can split the whole idea into two parts: for now, because
of the current series implementing the same function as SO_TIMETAMPING
does, I will implement the selective sample feature in the series.
After someday we finish tracing all the skb, then we will add the
corresponding selective sample feature.

But the default mode is the exact same as SO_TIMESTAMPING instead of
asking bpf prog to enable the sample feature. Does it make sense to
you?

With that said, the patch looks like this:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1f528e63bc71..73909dad7ed4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -497,11 +497,14 @@ static void tcp_tx_timestamp(struct sock *sk,
struct sockcm_cookie *sockc)
            SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
                struct skb_shared_info *shinfo = skb_shinfo(skb);
                struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+               bool enable_sample = true;

-               tcb->txstamp_ack_bpf = 1;
-               shinfo->tx_flags |= SKBTX_BPF;
-               shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
-               bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
+               enable_sample = bpf_skops_tx_timestamping(sk, skb,
BPF_SOCK_OPS_TS_SND_CB);
+               if (enable_sample) {
+                       tcb->txstamp_ack_bpf = 1;
+                       shinfo->tx_flags |= SKBTX_BPF;
+                       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+               }
        }
 }

Thanks,
Jason