On Fri, Feb 7, 2025 at 12:22 AM Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > > Jason Xing wrote: > > On Thu, Feb 6, 2025 at 11:00 AM Willem de Bruijn > > <willemdebruijn.kernel@xxxxxxxxx> wrote: > > > > > > Jason Xing wrote: > > > > On Thu, Feb 6, 2025 at 5:02 AM Willem de Bruijn > > > > <willemdebruijn.kernel@xxxxxxxxx> wrote: > > > > > > > > > > Jason Xing wrote: > > > > > > On Wed, Feb 5, 2025 at 11:20 PM Willem de Bruijn > > > > > > <willemdebruijn.kernel@xxxxxxxxx> wrote: > > > > > > > > > > > > > > Jason Xing wrote: > > > > > > > > On Wed, Feb 5, 2025 at 2:09 AM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > On Wed, Feb 5, 2025 at 1:08 AM Willem de Bruijn > > > > > > > > > <willemdebruijn.kernel@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > Jason Xing wrote: > > > > > > > > > > > On Tue, Feb 4, 2025 at 9:16 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 1/28/25 12:46 AM, Jason Xing wrote: > > > > > > > > > > > > > Introduce the callback to correlate tcp_sendmsg timestamp with other > > > > > > > > > > > > > points, like SND/SW/ACK. We can let bpf trace the beginning of > > > > > > > > > > > > > tcp_sendmsg_locked() and fetch the socket addr, so that in > > > > > > > > > > > > > > > > > > > > > > > > Instead of "fetch the socket addr...", should be "store the sendmsg timestamp at > > > > > > > > > > > > the bpf_sk_storage ...". > > > > > > > > > > > > > > > > > > > > > > I will revise it. Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > tcp_tx_timestamp() we can correlate the tskey with the socket addr. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It is accurate since they are under the protect of socket lock. > > > > > > > > > > > > > More details can be found in the selftest. > > > > > > > > > > > > > > > > > > > > > > > > The selftest uses the bpf_sk_storage to store the sendmsg timestamp at > > > > > > > > > > > > fentry/tcp_sendmsg_locked and retrieves it back at tcp_tx_timestamp (i.e. > > > > > > > > > > > > BPF_SOCK_OPS_TS_SND_CB added in this patch). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jason Xing <kerneljasonxing@xxxxxxxxx> > > > > > > > > > > > > > --- > > > > > > > > > > > > > include/uapi/linux/bpf.h | 7 +++++++ > > > > > > > > > > > > > net/ipv4/tcp.c | 1 + > > > > > > > > > > > > > tools/include/uapi/linux/bpf.h | 7 +++++++ > > > > > > > > > > > > > 3 files changed, 15 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > > > > > > > > > > index 800122a8abe5..accb3b314fff 100644 > > > > > > > > > > > > > --- a/include/uapi/linux/bpf.h > > > > > > > > > > > > > +++ b/include/uapi/linux/bpf.h > > > > > > > > > > > > > @@ -7052,6 +7052,13 @@ enum { > > > > > > > > > > > > > * when SK_BPF_CB_TX_TIMESTAMPING > > > > > > > > > > > > > * feature is on. > > > > > > > > > > > > > */ > > > > > > > > > > > > > + BPF_SOCK_OPS_TS_SND_CB, /* Called when every sendmsg syscall > > > > > > > > > > > > > + * is triggered. For TCP, it stays > > > > > > > > > > > > > + * in the last send process to > > > > > > > > > > > > > + * correlate with tcp_sendmsg timestamp > > > > > > > > > > > > > + * with other timestamping callbacks, > > > > > > > > > > > > > + * like SND/SW/ACK. > > > > > > > > > > > > > > > > > > > > > > > > Do you have a chance to look at how this will work at UDP? > > > > > > > > > > > > > > > > > > > > > > Sure, I feel like it could not be useful for UDP. Well, things get > > > > > > > > > > > strange because I did write a long paragraph about this thing which > > > > > > > > > > > apparently disappeared... > > > > > > > > > > > > > > > > > > > > > > I manage to find what I wrote: > > > > > > > > > > > For UDP type, BPF_SOCK_OPS_TS_SND_CB may be not suitable because > > > > > > > > > > > there are two sending process, 1) lockless path, 2) lock path, which > > > > > > > > > > > should be handled carefully later. For the former, even though it's > > > > > > > > > > > unlikely multiple threads access the socket to call sendmsg at the > > > > > > > > > > > same time, I think we'd better not correlate it like what we do to the > > > > > > > > > > > TCP case because of the lack of sock lock protection. Considering SND_CB is > > > > > > > > > > > uapi flag, I think we don't need to forcely add the 'TCP_' prefix in > > > > > > > > > > > case we need to use it someday. > > > > > > > > > > > > > > > > > > > > > > And one more thing is I'd like to use the v5[1] method in the next round > > > > > > > > > > > to introduce a new tskey_bpf which is good for UDP type. The new field > > > > > > > > > > > will not conflict with the tskey in shared info which is generated > > > > > > > > > > > by sk->sk_tskey in __ip_append_data(). It hardly works if both features > > > > > > > > > > > (so_timestamping and its bpf extension) exists at the same time. Users > > > > > > > > > > > could get confused because sometimes they fetch the tskey from skb, > > > > > > > > > > > sometimes they don't, especially when we have cmsg feature to turn it on/ > > > > > > > > > > > off per sendmsg. A standalone tskey for bpf extension will be needed. > > > > > > > > > > > With this tskey_bpf, we can easily correlate the timestamp in sendmsg > > > > > > > > > > > syscall with other tx points(SND/SW/ACK...). > > > > > > > > > > > > > > > > > > > > > > [1]: https://lore.kernel.org/all/20250112113748.73504-14-kerneljasonxing@xxxxxxxxx/ > > > > > > > > > > > > > > > > > > > > > > If possible, we can leave this question until the UDP support series > > > > > > > > > > > shows up. I will figure out a better solution :) > > > > > > > > > > > > > > > > > > > > > > In conclusion, it probably won't be used by the UDP type. It's uAPI > > > > > > > > > > > flag so I consider the compatibility reason. > > > > > > > > > > > > > > > > > > > > I don't think this is acceptable. We should aim for an API that can > > > > > > > > > > easily be used across protocols, like SO_TIMESTAMPING. > > > > > > > > > > > > > > > > > > After I revisit the UDP SO_TIMESTAMPING again, my thoughts are > > > > > > > > > adjusted like below: > > > > > > > > > > > > > > > > > > It's hard to provide an absolutely uniform interface or usage to users > > > > > > > > > for TCP and UDP and even more protocols. Cases can be handled one by > > > > > > > > > one. > > > > > > > > > > > > > > We should try hard. SO_TIMESTAMPING is uniform across protocols. > > > > > > > An interface that is not is just hard to use. > > > > > > > > > > > > > > > > The main obstacle is how we can correlate the timestamp in > > > > > > > > > sendmsg syscall with other sending timestamps. It's worth noticing > > > > > > > > > that for SO_TIMESTAMPING the sendmsg timestamp is collected in the > > > > > > > > > userspace. For instance, while skb enters the qdisc, we fail to know > > > > > > > > > which skb belongs to which sendmsg. > > > > > > > > > > > > > > > > > > An idea coming up is to introduce BPF_SOCK_OPS_TS_SND_CB to correlate > > > > > > > > > the sendmsg timestamp with tskey (in tcp_tx_timestamp()) under the > > > > > > > > > protection of socket lock + syscall as the current patch does. But for > > > > > > > > > UDP, it can be lockless. IIUC, there is a very special case where even > > > > > > > > > SO_TIMESTAMPING may get lost: if multiple threads accessing the same > > > > > > > > > socket send UDP packets in parallel, then users could be confused > > > > > > > > > which tskey matches which sendmsg. > > > > > > > > > > > > > > This is a known issue for lockless datagram sockets. > > > > > > > > > > > > > > With SO_TIMESTAMPING, but the use of timestamping and of concurrent > > > > > > > sendmsg calls is under control of the process, so it only shoots > > > > > > > itself in the foot. > > > > > > > > > > > > > > With BPF timestamping, a process may confuse a third party admin, so > > > > > > > the situation is slightly different. > > > > > > > > > > > > Agreed. > > > > > > > > > > > > > > > > > > > > > > IIUC, I will not consider this > > > > > > > > > unlikely case, then the UDP case is quite similar to the TCP case. > > > > > > > > > > > > > > > > > > The scenario for the UDP case is: > > > > > > > > > 1) using fentry bpf to hook the udp_sendmsg() to get the timestamp > > > > > > > > > like TCP does in this series. > > > > > > > > > 2) insert BPF_SOCK_OPS_TS_SND_CB into __ip_append_data() near the > > > > > > > > > SO_TIMESTAMPING code snippets to let bpf program correlate the tskey > > > > > > > > > with timestamp. > > > > > > > > > Note: tskey in UDP will be handled carefully in a different way > > > > > > > > > because we should support both modes for socket timestamping at the > > > > > > > > > same time. > > > > > > > > > It's really similar to TCP regardless of handling tskey. > > > > > > > > > > > > > > > > > > > > > > > > > To be more precise in case you don't have much time to read the above > > > > > > > > long paragraph, BPF_SOCK_OPS_TS_SND_CB is mainly used to correlate > > > > > > > > sendmsg timestamp with corresponding tskey. > > > > > > > > > > > > > > > > 1. For TCP, we can correlate it in tcp_tx_timestamp() like this patch does. > > > > > > > > > > > > > > > > 2. For UDP, we can correlate in __ip_append_data() along with those > > > > > > > > tskey initialization, assuming there are no multiple threads calling > > > > > > > > locklessly ip_make_skb(). Locked path > > > > > > > > (udp_sendmsg()->ip_append_data()) works like TCP under the socket lock > > > > > > > > protection, so it can be easily handled. Lockless path > > > > > > > > (udp_sendmsg()->ip_make_skb()) can be visited by multiple threads at > > > > > > > > the same time, which should be handled properly. > > > > > > > > > > > > > > Different hook points is fine, as UDP (and RAW) uses __ip_append_data > > > > > > > > > > > > Then this approach (introducing this new flag) is feasible. Sorry that > > > > > > last night I wrote such a long paragraph which buried something > > > > > > important. Because of that, I rephrase the whole idea about how to let > > > > > > UDP work with this kind of new flag in [patch v8 11/12]. Link is > > > > > > https://lore.kernel.org/all/CAL+tcoCmXcDot-855XYU7PKCiGvJL=O3CQBGuOTRAs2_=Ys=gg@xxxxxxxxxxxxxx/ > > > > > > > > > > > > > or more importantly ip_send_skb, while TCP uses ip_queue_xmit. > > > > > > > > > > > > For TCP, we use tcp_tx_timestamp to finish the map between sendmsg > > > > > > timestamp and tskey. > > > > > > > > > > > > > > > > > > > > As long as the API is the same: the operation (BPF_SOCK_OPS_TS_SND_CB) > > > > > > > and the behavior of that operation. Subject to the usual distinction > > > > > > > between protocol behavior (bytestream vs datagram). > > > > > > > > > > > > I see your point. > > > > > > > > > > > > > > > > > > > > > I prefer to implement > > > > > > > > the bpf extension for IPCORK_TS_OPT_ID, which should be another topic, > > > > > > > > I think. This might be the only one corner case, IIUC? > > > > > > > > > > > > > > This sounds like an entirely different topic? Not sure what this is. > > > > > > > > > > > > Not really a different topic. I mean let bpf prog take the whole > > > > > > control of setting the tskey, then with this BPF_SOCK_OPS_TS_SND_CB > > > > > > flag we can correlate the sendmsg timestamp with tskey. So It has > > > > > > something to do with the usage of UDP. Please take a look at that link > > > > > > to patch 11/12. For TCP, we don't need to care about the value of > > > > > > tskey which has already been taken care of by SO_TIMESTAMPING. So it > > > > > > is slightly different. I'm not sure if this kind of usage is > > > > > > acceptable? > > > > > > > > > > Why can TCP rely on SO_TIMESTAMPING to set tskey, but UDP cannot? > > > > > > > > Because for TCP the shared info tskey is calculated by seqno (in > > > > tcp_tx_timestamp()), so it works for so_timestamping and its bpf > > > > extension and they are the same. However, for UDP, the shared info > > > > tskey can be different, depending on when to call __ip_append_data() > > > > and what the sk->sk_tskey is. It can cause conflicts when two modes > > > > work at the same time. > > > > > > lockless and locked cannot conflict. (if up->pending then the only > > > option is to append to that.) > > > > Sorry, I should have described more about this point. I was trying to > > say that if two modes (bpf extension and application timestamping) > > work at the same time, will the tskey get messed up? Because we have > > to check if the application mode is turned on. If on, we fetch the > > existing key generated by the application mode, or else we generate > > one by modifying the sk->sk_tskey or the tskey of skb? > > This applies to all protocols that implement both. TCP, UDP, > eventually RAW and maybe others like L2TP, CAN, MPTCP, .. > > Which is why it should be handled the same uniformly. Got it. > > > > > > > > More than that, lockless UDP case is a tough > > > > one since we cannot correlate the sendmsg timestamp in udp_sendmsg() > > > > with the tskey generated in __ip_append_data(), > > > > > > With SO_TIMESTAMPING we do not have this distinction between TCP and > > > UDP, so we don't need it here. > > > > > > It is true that multiple lockless sendmsg calls can race and in that > > > case that correlation is ambiguous. That is also the case for > > > SO_TIMESTAMPING and a known issue. > > > > > > This is unlikely in most workloads in practice. > > > > Oh, I see. > > > > > > which is a long gap > > > > without any lock. So reuse the IPCORK_TS_OPT_ID logic for bpf > > > > extension here can work. > > > > > > It is fine to add a solution to work around the ambiguity. But not > > > to make it a precondition and so diverge the API for TCP and UDP. > > > > There is no divergence in API. BPF always uses SND flag to finish the > > correlation. The only difference is UDP needs to manage the tskey pool > > (allocating which tskey to which sendmsg). > > Again, I don't see how this is UDP specific. > > > > > > > > > The same argument to choose a key from BPF can be made for TCP to > > > a certain extent. > > > > Well...right, but we don't bother to do this for TCP. The TCP case is simpler. > > > > It seems that you object to the idea that let bpf prog control the > > allocating tskey for UDP _as default_. > > Correct. All protocols should have the same sensible default behavior. I see. > > > I'm not sure if I follow you. Sorry for repeating in case that I miss something: > > 1) For UDP, do not allocate the tskey from the bpf program, unless > > it's an alternative/workaround to handle the lockless case. So it's a > > backup choice. > > And an alterative API should apply equally to all protocols too. It seems feasible. > > > 2) Use the exact same way like TCP to finish the correlation on the > > basis of socket lock protection _as default_. Because the lockless > > method is seldomly used then we may provide 1) method? > > ? > > The opposite: the default for UDP is the lockless fast path. Then using SND flag to correlate for UDP is safe (by putting the SND flag in the __ip_append_data() is suitable for at least UDP case). Now I have a clear vision and goal after so many rounds of discussion. Thanks. Thanks, Jason