Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work

Jason Xing <kerneljasonxing@xxxxxxxxx> · Thu, 6 Feb 2025 09:05:20 +0800

On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 2/5/25 4:12 PM, Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> >>
> >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> >>>> +
> >>>> +            tcb->txstamp_ack_bpf = 1;
> >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> >>>> +    }
> >>>
> >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> >>> right?
> >>
> >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> >
> > Martin, I'm afraid it's not like what you expect. Only the last
> > portion of the sendmsg will enter the above function which means if
> > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > and be timestamped.
>
> Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
>
> My point is, only attaching a bpf alone is not enough. The
> SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.

Right.

>
> >
> >>
> >>>
> >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> >>> interested in tracing current packet all the way thru the stack?
> >>
> >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> >> particular socket.
> >>
> >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> >> already has another usage, which its return value is currently enforced by the
> >> verifier. It is better not to convolute it further.
> >>
> >> I don't prefer to add more use cases to skops->reply either, which is an union
> >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> >>
> >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> >
> > Agreed because at netdev 0x19 I have an explicit plan to share the
> > experience from our company about how to trace all the skbs which were
> > completed through a kernel module. It's how we use in production
> > especially for debug or diagnose use.
>
> This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> why move the bit setting into kfunc makes the whole set won't work.
>
> > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > are some functions that I can take as good examples?
> >
> > I think it's a standalone and good feature, can I handle it after this series?
>
> Unfortunately, no. Once the default is on, this cannot be changed.
>
> I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> is useful, so I suggested a way to do it.

Because, sorry, I don't want to postpone this series any longer (blame
on me for delaying almost 4 months), only wanting to focus on the
extension for SO_TIMESTAMPING so that we can quickly move on with
small changes per series.

Selectively sampling the skbs or sampling all the skbs could be an
optional good choice/feature for users instead of mandatory?

There are two kinds of monitoring in production: 1) daily monitoring,
2) diagnostic monitoring which I'm not sure if I translate in the
right way. For the former that is obviously a light-weight feature, I
think we don't need to trace that many skbs, only the last skb is
enough which was done in Google because even the selective feature[1]
is a little bit heavy. I received some complaints from a few
latency-sensitive customers to ask us if we can reduce the monitoring
in the kernel because as I mentioned before many issues are caused by
the application itself instead of kernel.

[1] selective feature consists of two parts, only selectively
collecting all the skbs in a certain period or selectively collecting
exactly like what SO_TIMESTAMPING does in a certain period. It might
need a full discussion, I reckon.

Thanks,
Jason