Re: [PATCH net-next v2 04/12] net-timestamp: add static key to control the whole bpf extension

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Tue, 22 Oct 2024 20:17:38 -0400

Martin KaFai Lau wrote:
> On 10/20/24 2:51 PM, Willem de Bruijn wrote:
> > Jason Xing wrote:
> >> From: Jason Xing <kernelxing@xxxxxxxxxxx>
> >>
> >> Willem suggested that we use a static key to control. The advantage
> >> is that we will not affect the existing applications at all if we
> >> don't load BPF program.
> >>
> >> In this patch, except the static key, I also add one logic that is
> >> used to test if the socket has enabled its tsflags in order to
> >> support bpf logic to allow both cases to happen at the same time.
> >> Or else, the skb carring related timestamp flag doesn't know which
> >> way of printing is desirable.
> >>
> >> One thing important is this patch allows print from both applications
> >> and bpf program at the same time. Now we have three kinds of print:
> >> 1) only BPF program prints
> >> 2) only application program prints
> >> 3) both can print without side effect
> >>
> >> Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx>
> > 
> > Getting back to this thread. It is long, instead of responding to
> > multiple messages, let me combine them in a single response.
> > 
> > 
> > * On future extensions:
> > 
> > +1 that the UDP case, and datagrams more broadly, must have a clear
> > development path, before we can merge TCP.
> > 
> > Similarly, hardware timestamps need not be supported from the start,
> > but must clearly be supportable.
> > 
> > 
> > * On queueing packets to userspace:
> > 
> >>> the current behavior is to just queue to the sk_error_queue as long
> >>> as there is "SOF_TIMESTAMPING_TX_*" set in the skb's tx_flags and it
> >>> is regardless of the sk_tsflags. "
> > 
> >> Totally correct. SOF_TIMESTAMPING_SOFTWARE is a report flag while
> >> SOF_TIMESTAMPING_TX_* are generation flags. Without former, users can
> >> read the skb from the errqueue but are not able to parse the
> >> timestamps
> > 
> > Before queuing a packet to userspace on the error queue, the relevant
> > reporting flag is always tested. sock_recv_timestamp has:
> > 
> >          /*
> >           * generate control messages if
> >           * - receive time stamping in software requested
> >           * - software time stamp available and wanted
> >           * - hardware time stamps available and wanted
> >           */
> >          if (sock_flag(sk, SOCK_RCVTSTAMP) ||
> >              (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
> >              (kt && tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
> >              (hwtstamps->hwtstamp &&
> >               (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
> >                  __sock_recv_timestamp(msg, sk, skb);
> > 
> > Otherwise applications could get error messages queued, and
> > epoll/poll/select would unexpectedly behave differently.
> 
> I just tried the following diff to remove setsockopt from txtimestamp.c and run 
> "./txtimestamp -6 -c 1 -C -N -L ::1". It is getting the skb from the error queue 
> with only cmsg flag.

That it surprising and against the API intent as I understand it.
Let me reproduce and take a closer look.

> I did a printk in __skb_tstamp_tx to ensure the
> sk->sk_tsflags is empty also.
> 
> diff --git i/tools/testing/selftests/net/txtimestamp.c 
> w/tools/testing/selftests/net/txtimestamp.c
> index dae91eb97d69..5d9d2773b076 100644
> --- i/tools/testing/selftests/net/txtimestamp.c
> +++ w/tools/testing/selftests/net/txtimestamp.c
> @@ -319,6 +319,8 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int 
> payload_len)
>   	for (cm = CMSG_FIRSTHDR(msg);
>   	     cm && cm->cmsg_len;
>   	     cm = CMSG_NXTHDR(msg, cm)) {
> +		printf("cm->cmsg_level %d cm->cmsg_type %d\n",
> +		       cm->cmsg_level, cm->cmsg_type);
>   		if (cm->cmsg_level == SOL_SOCKET &&
>   		    cm->cmsg_type == SCM_TIMESTAMPING) {
>   			tss = (void *) CMSG_DATA(cm);
> @@ -362,7 +364,7 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int 
> payload_len)
>   	if (batch > 1) {
>   		fprintf(stderr, "batched %d timestamps\n", batch);
>   	} else if (!batch) {
> -		fprintf(stderr, "Failed to report timestamps\n");
> +		fprintf(stderr, "Failed to report timestamps. payload_len %d\n", payload_len);
>   		test_failed = true;
>   	}
>   }
> @@ -578,9 +580,12 @@ static void do_test(int family, unsigned int report_opt)
>   	if (cfg_loop_nodata)
>   		sock_opt |= SOF_TIMESTAMPING_OPT_TSONLY;
> 
> +	(void)sock_opt;
> +/*
>   	if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING,
>   		       (char *) &sock_opt, sizeof(sock_opt)))
>   		error(1, 0, "setsockopt timestamping");
> +*/
> 
>   	for (i = 0; i < cfg_num_pkts; i++) {
>   		memset(&msg, 0, sizeof(msg));
> > 
> >> SOF_TIMESTAMPING_SOFTWARE is only used in traditional SO_TIMESTAMPING
> >> features including cmsg mode. But it will not be used in bpf mode.
> > 
> > For simplicity, the two uses of the API are best kept identical. If
> > there is a technical reason why BPF has to diverge from established
> > behavior, this needs to be explicitly called out in the commit
> > message.
> 
> SOF_TIMESTAMPING_OPT_TSONLY will not be supported. The orig_skb can always be 
> passed directly to the bpf if needed without extra cost. The same probably goes 
> for SOF_TIMESTAMPING_OPT_PKTINFO. SOF_TIMESTAMPING_SOFTWARE does not seem to be 
> useful either. I think only a subset of SOF_* will be supported, probably only 
> the TX_* and RX_* ones.
> 
> > 
> > Also, if you want to extend the API for BPF in the future, good to
> > call this out now and ideally extensions will apply to both, to
> > maintain a uniform API.
> > 
> > 
> > * On extra measurement points, at sendmsg or tcp_write_xmit:
> > 
> > The first is interesting. For application timestamping, this was
> > never needed, as the application can just call clock_gettime before
> > sendmsg.
> > 
> > In general, additional measurement points are not only useful if the
> > interval between is not constant. So far, we have seen no need for
> > any additional points.
> > 
> > 
> > * On skb state:
> > 
> >>> For now, is there thing we can explore to share in the skb_shared_info?
> > 
> > skb_shinfo space is at a premium. I don't think we can justify two
> > extra fields just for this use case.
> > 
> >> My initial thought is just to reuse these fields in skb. It can work
> >> without interfering one another.
> > 
> > I'm skeptical that two methods can work at the same time. If they are
> > started at different times, their sk_tskey will be different, for one.
> 
> For the skb's tx_flags, Jason seems to be able to figure out by only using the 
> new sk_tsflags_bpf. In the worst case, it seems there is still one bit left in 
> tx_flags.
> 
> I am also not very positive on the skb's tskey for now.
> 
> Willem, I recalled I had tried to reuse the tx_flags and hwtstamp when keeping 
> the delivery time in skb->tstamp for a skb redirecting from egress to ingress. I 
> think that approach was stalled because the tx_flags could be changed by the 
> netdevice like "skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS". How about the 
> skb_shinfo(skb)->hwtstamps? At least for the TX path, it should not be changed 
> until the netdevice calling skb_tstamp_tx() to report the hwtstamp? or the clone 
> in the tcp stack will still break things if the hwtstamps is reused for other 
> purpose?

True. I think on Tx hwtstamps is only used on the path from the driver
tx completion handler to when it calls skb_tstamp_tx.

It does not even really have to be an skb field. The first driver
cscope happens to point me to indeed just allocates it on the stack:
tsnep_tx_poll.

> > 
> > There may be workarounds. Maybe BPF can store its state in some BPF
> > specific field, indeed. Or perhaps it can store per-sk shadow state
> > that resolves the conflict. For instance, the offset between sk_tskey
> > and bpf_tskey.
> 
> I have also been proposing to explore other way for the key since bpf has direct 
> access to the skb (also the sk, bpf prog can store data in the sk).
> 
> The bpf prog can learn what is the seq_no of the egress-ing skb. When the ack 
> comes back, it can also learn the ack seq no. Does it help? It will be harder to 
> use because it probably needs to store this info in the bpf map (or in the bpf 
> sk storage). However, if it needs to learn the timestamp at the 
> tcp_sendmsg/tcp_transmit_skb/tcp_write_xmit, this timestamp has to be stored 
> somewhere also. Either in a bpf map or in a bpf sk storage.
> 
> SEC("cgroup/setsockopt") prog can also enforce the user space setsockopt. e.g. 
> it can add SOF_TIMESTAMPING_OPT_ID_TCP when user space only use 
> SOF_TIMESTAMPING_OPT_ID.