Re: [PATCH net-next v2 04/12] net-timestamp: add static key to control the whole bpf extension

Jason Xing <kerneljasonxing@xxxxxxxxx> · Tue, 22 Oct 2024 10:30:42 +0800

On Tue, Oct 22, 2024 at 8:53 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 10/20/24 2:51 PM, Willem de Bruijn wrote:
> > Jason Xing wrote:
> >> From: Jason Xing <kernelxing@xxxxxxxxxxx>
> >>
> >> Willem suggested that we use a static key to control. The advantage
> >> is that we will not affect the existing applications at all if we
> >> don't load BPF program.
> >>
> >> In this patch, except the static key, I also add one logic that is
> >> used to test if the socket has enabled its tsflags in order to
> >> support bpf logic to allow both cases to happen at the same time.
> >> Or else, the skb carring related timestamp flag doesn't know which
> >> way of printing is desirable.
> >>
> >> One thing important is this patch allows print from both applications
> >> and bpf program at the same time. Now we have three kinds of print:
> >> 1) only BPF program prints
> >> 2) only application program prints
> >> 3) both can print without side effect
> >>
> >> Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx>
> >
> > Getting back to this thread. It is long, instead of responding to
> > multiple messages, let me combine them in a single response.
> >
> >
> > * On future extensions:
> >
> > +1 that the UDP case, and datagrams more broadly, must have a clear
> > development path, before we can merge TCP.
> >
> > Similarly, hardware timestamps need not be supported from the start,
> > but must clearly be supportable.
> >
> >
> > * On queueing packets to userspace:
> >
> >>> the current behavior is to just queue to the sk_error_queue as long
> >>> as there is "SOF_TIMESTAMPING_TX_*" set in the skb's tx_flags and it
> >>> is regardless of the sk_tsflags. "
> >
> >> Totally correct. SOF_TIMESTAMPING_SOFTWARE is a report flag while
> >> SOF_TIMESTAMPING_TX_* are generation flags. Without former, users can
> >> read the skb from the errqueue but are not able to parse the
> >> timestamps
> >
> > Before queuing a packet to userspace on the error queue, the relevant
> > reporting flag is always tested. sock_recv_timestamp has:
> >
> >          /*
> >           * generate control messages if
> >           * - receive time stamping in software requested
> >           * - software time stamp available and wanted
> >           * - hardware time stamps available and wanted
> >           */
> >          if (sock_flag(sk, SOCK_RCVTSTAMP) ||
> >              (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
> >              (kt && tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
> >              (hwtstamps->hwtstamp &&
> >               (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
> >                  __sock_recv_timestamp(msg, sk, skb);
> >
> > Otherwise applications could get error messages queued, and
> > epoll/poll/select would unexpectedly behave differently.
>
> I just tried the following diff to remove setsockopt from txtimestamp.c and run
> "./txtimestamp -6 -c 1 -C -N -L ::1". It is getting the skb from the error queue
> with only cmsg flag. I did a printk in __skb_tstamp_tx to ensure the
> sk->sk_tsflags is empty also.
>
> diff --git i/tools/testing/selftests/net/txtimestamp.c
> w/tools/testing/selftests/net/txtimestamp.c
> index dae91eb97d69..5d9d2773b076 100644
> --- i/tools/testing/selftests/net/txtimestamp.c
> +++ w/tools/testing/selftests/net/txtimestamp.c
> @@ -319,6 +319,8 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int
> payload_len)
>         for (cm = CMSG_FIRSTHDR(msg);
>              cm && cm->cmsg_len;
>              cm = CMSG_NXTHDR(msg, cm)) {
> +               printf("cm->cmsg_level %d cm->cmsg_type %d\n",
> +                      cm->cmsg_level, cm->cmsg_type);
>                 if (cm->cmsg_level == SOL_SOCKET &&
>                     cm->cmsg_type == SCM_TIMESTAMPING) {
>                         tss = (void *) CMSG_DATA(cm);
> @@ -362,7 +364,7 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int
> payload_len)
>         if (batch > 1) {
>                 fprintf(stderr, "batched %d timestamps\n", batch);
>         } else if (!batch) {
> -               fprintf(stderr, "Failed to report timestamps\n");
> +               fprintf(stderr, "Failed to report timestamps. payload_len %d\n", payload_len);
>                 test_failed = true;
>         }
>   }
> @@ -578,9 +580,12 @@ static void do_test(int family, unsigned int report_opt)
>         if (cfg_loop_nodata)
>                 sock_opt |= SOF_TIMESTAMPING_OPT_TSONLY;
>
> +       (void)sock_opt;
> +/*
>         if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING,
>                        (char *) &sock_opt, sizeof(sock_opt)))
>                 error(1, 0, "setsockopt timestamping");
> +*/
>
>         for (i = 0; i < cfg_num_pkts; i++) {
>                 memset(&msg, 0, sizeof(msg));
> >
> >> SOF_TIMESTAMPING_SOFTWARE is only used in traditional SO_TIMESTAMPING
> >> features including cmsg mode. But it will not be used in bpf mode.
> >
> > For simplicity, the two uses of the API are best kept identical. If
> > there is a technical reason why BPF has to diverge from established
> > behavior, this needs to be explicitly called out in the commit
> > message.
>
> SOF_TIMESTAMPING_OPT_TSONLY will not be supported. The orig_skb can always be
> passed directly to the bpf if needed without extra cost. The same probably goes
> for SOF_TIMESTAMPING_OPT_PKTINFO.

Right, they will not be supported.

> SOF_TIMESTAMPING_SOFTWARE does not seem to be
> useful either. I think only a subset of SOF_* will be supported, probably only

I had a discussion with Willem on this point yesterday. If I
understand what Willem was thinking correctly, he doesn't expect
users' behaviors to change too much.

As I said previously, I have no strong preference. Whether keeping
this report flag or not doesn't affect the core logic for BPF
extension.

> the TX_* and RX_* ones.
>
> >
> > Also, if you want to extend the API for BPF in the future, good to
> > call this out now and ideally extensions will apply to both, to
> > maintain a uniform API.
> >
> >
> > * On extra measurement points, at sendmsg or tcp_write_xmit:
> >
> > The first is interesting. For application timestamping, this was
> > never needed, as the application can just call clock_gettime before
> > sendmsg.
> >
> > In general, additional measurement points are not only useful if the
> > interval between is not constant. So far, we have seen no need for
> > any additional points.
> >
> >
> > * On skb state:
> >
> >>> For now, is there thing we can explore to share in the skb_shared_info?
> >
> > skb_shinfo space is at a premium. I don't think we can justify two
> > extra fields just for this use case.
> >
> >> My initial thought is just to reuse these fields in skb. It can work
> >> without interfering one another.
> >
> > I'm skeptical that two methods can work at the same time. If they are
> > started at different times, their sk_tskey will be different, for one.
>
> For the skb's tx_flags, Jason seems to be able to figure out by only using the
> new sk_tsflags_bpf. In the worst case, it seems there is still one bit left in
> tx_flags.

Let me try, then we'll see if it works.

>
> I am also not very positive on the skb's tskey for now.

For TCP, the final output of tskey that is reflected to users is the
result of this calculation "shinfo->tskey - $KEY". $KEY is the base
which could be either sk->sk_tskey or sk->sk_tskey_bpf. They are
initialized at different points.

You can see the calculation in __skb_complete_tx_timestamp():
serr->ee.ee_data = skb_shinfo(skb)->tskey;
serr->ee.ee_data -= atomic_read(&sk->sk_tskey);

With that said, we will keep two different $KEY to let each feature
(bpf SO_TIMESTAMPING or application SO_TIMESTAMPING) work
respectively, which also means, we probably will see two different
tskeys when two methods work parallely. It's fine because as long as
we can make sure the final tskeys are consistent in each feature.
tskey is used to identify which sendmsg() the skb should belong to.

It also works for UDP proto.

>
> Willem, I recalled I had tried to reuse the tx_flags and hwtstamp when keeping
> the delivery time in skb->tstamp for a skb redirecting from egress to ingress. I
> think that approach was stalled because the tx_flags could be changed by the
> netdevice like "skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS". How about the
> skb_shinfo(skb)->hwtstamps? At least for the TX path, it should not be changed
> until the netdevice calling skb_tstamp_tx() to report the hwtstamp? or the clone
> in the tcp stack will still break things if the hwtstamps is reused for other
> purpose?
>
> >
> > There may be workarounds. Maybe BPF can store its state in some BPF
> > specific field, indeed. Or perhaps it can store per-sk shadow state
> > that resolves the conflict. For instance, the offset between sk_tskey
> > and bpf_tskey.
>
> I have also been proposing to explore other way for the key since bpf has direct
> access to the skb (also the sk, bpf prog can store data in the sk).
>
> The bpf prog can learn what is the seq_no of the egress-ing skb. When the ack
> comes back, it can also learn the ack seq no. Does it help? It will be harder to
> use because it probably needs to store this info in the bpf map (or in the bpf
> sk storage). However, if it needs to learn the timestamp at the
> tcp_sendmsg/tcp_transmit_skb/tcp_write_xmit, this timestamp has to be stored
> somewhere also. Either in a bpf map or in a bpf sk storage.

Thanks for the idea. But please see the above comment, we could keep
the logic as simple as it is :)

>
> SEC("cgroup/setsockopt") prog can also enforce the user space setsockopt. e.g.
> it can add SOF_TIMESTAMPING_OPT_ID_TCP when user space only use
> SOF_TIMESTAMPING_OPT_ID.

Interesting.

Thanks,
Jason