Re: [PATCH net-next 0/9] net-timestamp: bpf extension to equip applications transparently

Vadim Fedorenko <vadim.fedorenko@xxxxxxxxx> · Wed, 9 Oct 2024 15:59:44 +0100

On 09/10/2024 15:35, Jason Xing wrote:
On Wed, Oct 9, 2024 at 9:58 PM Vadim Fedorenko
<vadim.fedorenko@xxxxxxxxx> wrote:

On 09/10/2024 14:47, Jason Xing wrote:
On Wed, Oct 9, 2024 at 9:16 PM Vadim Fedorenko
<vadim.fedorenko@xxxxxxxxx> wrote:

On 09/10/2024 12:48, Jason Xing wrote:
On Wed, Oct 9, 2024 at 7:12 PM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:

On Wed, Oct 9, 2024 at 5:28 PM Vadim Fedorenko
<vadim.fedorenko@xxxxxxxxx> wrote:

On 09/10/2024 02:05, Jason Xing wrote:
On Wed, Oct 9, 2024 at 7:22 AM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:

On Wed, Oct 9, 2024 at 2:44 AM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:

Jason Xing wrote:
From: Jason Xing <kernelxing@xxxxxxxxxxx>

A few weeks ago, I planned to extend SO_TIMESTMAMPING feature by using
tracepoint to print information (say, tstamp) so that we can
transparently equip applications with this feature and require no
modification in user side.

Later, we discussed at netconf and agreed that we can use bpf for better
extension, which is mainly suggested by John Fastabend and Willem de
Bruijn. Many thanks here! So I post this series to see if we have a
better solution to extend.

This approach relies on existing SO_TIMESTAMPING feature, for tx path,
users only needs to pass certain flags through bpf program to make sure
the last skb from each sendmsg() has timestamp related controlled flag.
For rx path, we have to use bpf_setsockopt() to set the sk->sk_tsflags
and wait for the moment when recvmsg() is called.

As you mention, overall I am very supportive of having a way to add
timestamping by adminstrators, without having to rebuild applications.
BPF hooks seem to be the right place for this.

There is existing kprobe/kretprobe/kfunc support. Supporting
SO_TIMESTAMPING directly may be useful due to its targeted feature
set, and correlation between measurements for the same data in the
stream.

After this series, we could step by step implement more advanced
functions/flags already in SO_TIMESTAMPING feature for bpf extension.

My main implementation concern is where this API overlaps with the
existing user API, and how they might conflict. A few questions in the
patches.

Agreed. That's also what I'm concerned about. So I decided to ask for
related experts' help.

How to deal with it without interfering with the existing apps in the
right way is the key problem.

What I try to implement is let the bpf program have the highest
precedence. It's similar to RTO min, see the commit as an example:

commit f086edef71be7174a16c1ed67ac65a085cda28b1
Author: Kevin Yang <yyd@xxxxxxxxxx>
Date:   Mon Jun 3 21:30:54 2024 +0000

        tcp: add sysctl_tcp_rto_min_us

        Adding a sysctl knob to allow user to specify a default
        rto_min at socket init time, other than using the hard
        coded 200ms default rto_min.

        Note that the rto_min route option has the highest precedence
        for configuring this setting, followed by the TCP_BPF_RTO_MIN
        socket option, followed by the tcp_rto_min_us sysctl.

It includes three cases, 1) route option, 2) bpf option, 3) sysctl.
The first priority can override others. It doesn't have a good
chance/point to restore the icsk_rto_min field if users want to
shutdown the bpf program because it is set in
bpf_sol_tcp_setsockopt().

rto_min example is slightly different. With tcp_rto_min the doesn't
expect any data to come back to user space while for timestamping the
app may be confused directly by providing more data, or by not providing
expected data. I believe some hint about requestor of the data is needed
here. It will also help to solve the problem of populating sk_err_queue
mentioned by Martin.

Sorry, I don't fully get it. In this patch series, this bpf extension
feature will not rely on sk_err_queue any more to report tx timestamps
to userspace. Bpf program can do that printing.

Do you mean that it could be wrong if one skb carries the tsflags that
are previously set due to the bpf program and then suddenly users
detach the program? It indeed will put a new/cloned skb into the error
queue. Interesting corner case. It seems I have to re-implement a
totally independent tsflags for bpf extension feature. Do you have a
better idea on this?

I feel that if I could introduce bpf new flags like
SOF_TIMESTAMPING_TX_ACK_BPF for the last skb based on this patch
series, then it will not populate skb in sk_err_queue even users
remove the bpf program all of sudden. With this kind of specific bpf
flags, we can also avoid conflicting with the apps using
SO_TIEMSTAMPING feature. Let me give it a shot unless a better
solution shows up.

It doesn't look great to have duplicate flags just to indicate that this
particular timestamp was asked by a bpf program, even though it looks

Or introduce a new field in struct sock or struct sk_buff so that
existing SOF_TIMESTAMPING_* can be reused.

Well, I was thinking about this way. We can potentially add an array of
tsflags meaning the index of the array is the requestor. That will be
more flexible in terms of adding new requestor (like scheduler or
congestion control algo) if needed. But it comes with increased memory
usage on hot path which might be a blocker.

Is the following code snippet what you expect? But I wonder why not
just add a u32 field instead and then use each bit of it defined in
include/uapi/linux/net_tstamp.h?

diff --git a/include/net/sock.h b/include/net/sock.h
index b32f1424ecc5..4677f53da75a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -445,6 +445,7 @@ struct sock {
         u32                     sk_reserved_mem;
         int                     sk_forward_alloc;
         u32                     sk_tsflags;
+       u32                     new_tsflags[10];
         __cacheline_group_end(sock_write_rxtx);

         __cacheline_group_begin(sock_write_tx);

I could be missing something. Sorry. If possible, could you show me
some code snippets?

As for the new requestor, IIUC, do you want to add more tx timestamp
generating points in the future?

It's more like this:

diff --git a/include/net/sock.h b/include/net/sock.h
index c58ca8dd561b..93f931dcc4cc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -234,6 +234,14 @@ struct sock_common {
 struct bpf_local_storage;
 struct sk_filter;

+enum {
+       SOCKETOPT_TS_REQUESTOR = 0,
+       CMSG_TS_REQUESTOR,
+       BPFPROG_TS_REQUESTOR,
+
+       __MAX_TS_REQUESTOR,
+};
+
 /**
   *    struct sock - network layer representation of sockets
   *    @__sk_common: shared layout with inet_timewait_sock
@@ -444,7 +452,7 @@ struct sock {
        socket_lock_t           sk_lock;
        u32                     sk_reserved_mem;
        int                     sk_forward_alloc;
-       u32                     sk_tsflags;
+       u32                     sk_tsflags[__MAX_TS_REQUESTOR];
        __cacheline_group_end(sock_write_rxtx);

        __cacheline_group_begin(sock_write_tx);


And use existing SOF_TIMESTAMPING_* for each element in the array. Not
sure that struct sock is the best place though, as some timestamping
requests may be on per-packet basis for protocols other than TCP.

Again, I'm just thinking out loud, kinda wild idea.