On Thu, Feb 6, 2025 at 9:28 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 2/5/25 8:08 AM, Jason Xing wrote: > >>> + switch (skops->op) { > >>> + case BPF_SOCK_OPS_TS_SCHED_OPT_CB: > >>> + delay = val->sched_delay = timestamp - val->sendmsg_ns; > >>> + break; > >> > >> For a test this is fine. But just a reminder that in general a packet > >> may pass through multiple qdiscs. For instance with bonding or tunnel > >> virtual devices in the egress path. > > > > Right, I've seen this in production (two times qdisc timestamps > > because of bonding). > > > >> > >>> + case BPF_SOCK_OPS_TS_SW_OPT_CB: > >>> + prior_ts = val->sched_delay + val->sendmsg_ns; > >>> + delay = val->sw_snd_delay = timestamp - prior_ts; > >>> + break; > >>> + case BPF_SOCK_OPS_TS_ACK_OPT_CB: > >>> + prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns; > >>> + delay = val->ack_delay = timestamp - prior_ts; > >>> + break; > >> > >> Similar to the above: fine for a test, but in practice be aware that > >> packets may be resent, in which case an ACK might precede a repeat > >> SCHED and SND. And erroneous or malicious peers may also just never > >> send an ACK. So this can never be relied on in production settings, > >> e.g., as the only signal to clear an entry from a map (as done in the > >> branch below). > > All good points. I think all these notes should be added as comment to the test. Got it, I will add them in the commit message. > I think as a test, this will be a good start and can use some followup to > address the cases. Good idea. > > > > > Agreed. In production, actually what we do is print all the timestamps > > and let an agent parse them. > > The BPF program that runs in the kernel can provide its own user interface that > best fits its environment. If a raw printing interface is sufficient, that works > well and is simple on the BPF program side. If the production system cannot > afford the raw printing cost, the bpf prog can perform some aggregation first. > > The BPF program should be able to detect when an outgoing skb is re-transmitted > and act accordingly. There is BPF timer to retire entries for which no ACK has > been received. Oh, first time to know the BPF timer. > > Potentially, this data can be aggregated into the individual bpf_sk_storage or > using a BPF map keyed by a particular IP address prefix. > > I just want to highlight here for people in the future referencing this thread > to look for implementation ideas. Thanks, I think they are useful! I will copy more description in the commit message. Thanks, Jason