Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Thu, 31 Oct 2024 16:26:27 -0700

On 10/31/24 6:50 AM, Jason Xing wrote:
On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:

Jason Xing wrote:
On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:

On 10/30/24 5:13 PM, Jason Xing wrote:
I realized that we will have some new sock_opt flags like
TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
not... For each sock_opt point, they will be called without caring if
related flags in skb are set. Well, it's meaningless to add more
control of skb tsflags at each TS_xx_OPT_CB point.

Am I understanding in a correct way? Now, I'm totally fine with this great idea!
Yes, I think so.

The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
be quite wasteful to throw it away. ACK can be controlled by the
TCP_SKB_CB(skb)->bpf_txstamp_ack.

Right, let me try this:)

Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
comment. I think it may as well go back to use the "u8
bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
enable/disable the timestamp related callback hook. May be add one
BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.

bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
If that is so, it cannot be suitable for UDP.

I'm thinking of this solution:
1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags

probably not in include/uapi/linux/net_tstamp.h. This flag can only be used by a 
bpf prog (meaning will not be used by user space syscall). More below.

2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
in udp_sendmsg()
...

Not sure how many churns/audits is needed to ensure the user space cannot 
set/clear the SOF_TIMESTAMPING_OPT_BPF bit in sk->sk_tsflags. Could be not much.

May be it is cleaner to leave the sk->sk_tsflags for user space only and having 
a separate field in "struct sock" to track bpf specific needs. More like your 
current sk_tsflags_bpf approach but I was thinking to reuse the 
bpf_sock_ops_cb_flags instead. e.g. "BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), 
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)" is used to check if it needs to call a bpf 
prog to decide if it needs to add tcp header option. Here we want to test if it 
should call a bpf prog to make a decision on tx timestamp on a skb.

The bpf_sock_ops_cb_flags can be moved from struct tcp_sock to struct sock. It 
is doable from the bpf side.

All that said, but, yes, it will add some TCP specific enum flag (e.g. 
BPF_SOCK_OPS_RTO_CB_FLAG) to the struct sock which will not be used by 
UDP/raw/...etc, so may be keep your current sk_tsflags_bpf approach but rename 
it to sk_bpf_cb_flags in struct "sock" so that it can be reused for other non 
tstamp ops in the future? probably a u8 is enough.

This optname is used by the bpf prog only and not usable by user space syscall. 
If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf 
specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets 
the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting 
the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET, 
SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt() 
alone without calling into sk_{set,get}sockopt. Add a new enum for the optval 
for the sk_bpf_cb_flags:

enum {
	SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
	SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
};

For tx, one new hook should be at the sendmsg and should be around
tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be

I think there are two points we're supposed to record:
1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
2) another point in tcp_tx_timestamp(). It represents the timestamp of
the last skb in this sendmsg() call.
Users may happen to send a big packet.

hmm... a big packet and sendmsg is blocked waiting for memory?

Err on the side of fewer measurement points. It's always possible to
add more later, but not possible to remove them (depending on whether
BPF infra is ABI).

I also think it is better to start with tcp_tx_timestamp() alone first to keep 
the patch set simple now. The selftest prog can use a bpf fentry prog to trace 
the tcp_sendmsg_locked(). This can be revisited later if the bpf fentry prog is 
not enough.

Overall great suggestion. Thanks a lot for sharing your BPF expertise
on this, Martin.

Thanks!

On using the raw seqno: this data is accessible to anyone root in
namespace (ns_capable) using packet sockets, so as long as it does not
open to more than that, it is logically equivalent to the current
setting.

With seqno the BPF program has to be careful that the same seqno can
be retransmitted, so for instance seeing an ACK before a (second) SND
must be anticipated. That is true for SO_TIMESTAMPING today too.

Ah. It will be a very useful comment to add to the selftests bpf prog.

For datagrams (UDP as well as RAW and many non IP protocols), an
alternative still needs to be found.

In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags 
& SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) & 
SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is 
unlikely, may be we can just disallow bpf prog from directly setting 
skb_shinfo(skb)->tskey for this particular skb.

For all other cases, in __ip[6]_append_data, directly call a bpf prog and also 
pass the kernel decided tskey to the bpf prog.

The kernel passed tskey could be 0 (meaning the user space has not used it). The 
bpf prog can give one for the kernel to use. The bpf prog can store the 
sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct 
sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX 
instead) if it helps.

If the kernel passed tskey is not 0, the bpf prog can just use that one 
(assuming the user space is doing something sane, like the value in 
SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this 
is very unlikely also (?) but the bpf prog can probably detect this and choose 
to ignore this sk.

To solve the above unsupported corner cases, I think we can allow the bpf prog 
to store something in the shinfo->hwtstamps at the tx path. The bpf-only key 
could be one of the things to store there. Change __ip[6]_append_data to handle 
the shinfo->hwtstamps. I think allowing the bpf prog to write to the 
shinfo->hwtsatmps could be considered later when needed.

[ I may be off tomorrow, so reply could be slower. ]

It seems that using the tskey for bpf extension is always correct and
easy to use.

Could we provide the tskey to users and then let users decide the
better way to identify the call of sendmsg. We could keep the
traditional use of tskey. If without it, people need to figure out a
good way and may find it difficult to use the bpf extension.

I will keep thinking of alternatives for UDP in the meantime.