Re: [PATCH net-next v2 04/12] net-timestamp: add static key to control the whole bpf extension

Jason Xing <kerneljasonxing@xxxxxxxxx> · Mon, 21 Oct 2024 11:21:59 +0800

On Mon, Oct 21, 2024 at 5:52 AM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@xxxxxxxxxxx>
> >
> > Willem suggested that we use a static key to control. The advantage
> > is that we will not affect the existing applications at all if we
> > don't load BPF program.
> >
> > In this patch, except the static key, I also add one logic that is
> > used to test if the socket has enabled its tsflags in order to
> > support bpf logic to allow both cases to happen at the same time.
> > Or else, the skb carring related timestamp flag doesn't know which
> > way of printing is desirable.
> >
> > One thing important is this patch allows print from both applications
> > and bpf program at the same time. Now we have three kinds of print:
> > 1) only BPF program prints
> > 2) only application program prints
> > 3) both can print without side effect
> >
> > Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx>
>
> Getting back to this thread. It is long, instead of responding to
> multiple messages, let me combine them in a single response.

Thank you so much!

>
>
> * On future extensions:
>
> +1 that the UDP case, and datagrams more broadly, must have a clear
> development path, before we can merge TCP.
>
> Similarly, hardware timestamps need not be supported from the start,
> but must clearly be supportable.

Agreed. Using the standalone sk_tsflags_bpf and tskey_bpf and removing
the TCP bpf test logic(say, BPF_SOCK_OPS_TX_TIMESTAMPING_OPT_CB_FLAG)
could work well for both protos. Let me give it a try first.

>
>
> * On queueing packets to userspace:
>
> > > the current behavior is to just queue to the sk_error_queue as long
> > > as there is "SOF_TIMESTAMPING_TX_*" set in the skb's tx_flags and it
> > > is regardless of the sk_tsflags. "
>
> > Totally correct. SOF_TIMESTAMPING_SOFTWARE is a report flag while
> > SOF_TIMESTAMPING_TX_* are generation flags. Without former, users can
> > read the skb from the errqueue but are not able to parse the
> > timestamps

Above is what I tried to explain how the application timestamping
feature works, not what I tried to implement for the BPF extension.

>
> Before queuing a packet to userspace on the error queue, the relevant
> reporting flag is always tested. sock_recv_timestamp has:
>
>         /*
>          * generate control messages if
>          * - receive time stamping in software requested
>          * - software time stamp available and wanted
>          * - hardware time stamps available and wanted
>          */
>         if (sock_flag(sk, SOCK_RCVTSTAMP) ||
>             (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
>             (kt && tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
>             (hwtstamps->hwtstamp &&
>              (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
>                 __sock_recv_timestamp(msg, sk, skb);
>
> Otherwise applications could get error messages queued, and
> epoll/poll/select would unexpectedly behave differently.

Right. And I have no intention to use the SOF_TIMESTAMPING_SOFTWARE
flag for BPF.

>
> > SOF_TIMESTAMPING_SOFTWARE is only used in traditional SO_TIMESTAMPING
> > features including cmsg mode. But it will not be used in bpf mode.
>
> For simplicity, the two uses of the API are best kept identical. If
> there is a technical reason why BPF has to diverge from established
> behavior, this needs to be explicitly called out in the commit
> message.
>
> Also, if you want to extend the API for BPF in the future, good to
> call this out now and ideally extensions will apply to both, to
> maintain a uniform API.

As you said, I also agree on "two uses of the API are best kept identical".

>
>
> * On extra measurement points, at sendmsg or tcp_write_xmit:
>
> The first is interesting. For application timestamping, this was
> never needed, as the application can just call clock_gettime before
> sendmsg.

Yes, we could add it after we finish the current series. I'm going to
write it down on my todo list.

>
> In general, additional measurement points are not only useful if the
> interval between is not constant. So far, we have seen no need for
> any additional points.

Taking a snapshot of tcp_write_xmit() could be useful especially when
the skb is not transmitted due to nagle algorithm.

>
>
> * On skb state:
>
> > > For now, is there thing we can explore to share in the skb_shared_info?
>
> skb_shinfo space is at a premium. I don't think we can justify two
> extra fields just for this use case.
>
> > My initial thought is just to reuse these fields in skb. It can work
> > without interfering one another.
>
> I'm skeptical that two methods can work at the same time. If they are
> started at different times, their sk_tskey will be different, for one.

Right, sk_tskey is the only special one that I will take care of.
Others like tx_flags or txstamp_ack from struct tcp_skb_cb can be
reused.

>
> There may be workarounds. Maybe BPF can store its state in some BPF
> specific field, indeed. Or perhaps it can store per-sk shadow state
> that resolves the conflict. For instance, the offset between sk_tskey
> and bpf_tskey.

Things could get complicated in the future if we want to unified the
final tskey value for all the cases. Since 1) the value of
shinfo->tskey depends on skb seq and len, 2) the final tskey output is
the diff between sk_tskey and shinfo->tskey, can I add a bpf_tskey in
struct sock and related output logic for bpf without caring if it's
the same as sk_tskey.

That said, the outputs from two methods differ. Do you think it is
acceptable? It could be simpler and easier if we keep them identical.

Again, thanks for your long conclusion and every review.

Thanks,
Jason