On Sun, May 26, 2019 at 8:30 PM Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > > On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@xxxxxxxxxxx> wrote: > > > > > > > > > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > > > > > > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@xxxxxxxxxxx> wrote: > > >> > > >> > > >> > > >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > > >>> > > >>> It is the last moment that a timestamp can be generated for the last > > >>> byte, I don't see how that is "neither the start nor the end of a GSO > > >>> packet”. > > >> > > >> My misunderstanding. I thought TCP did last segment timestamping, not > > >> last byte. In that case, your statements make sense. > > >> > > >>>> It would be interesting if a practical case can be made for timestamping > > >>>> the last segment. In my mind, I don’t see how that would be valuable. > > >>> > > >>> It depends whether you are interested in measuring network latency or > > >>> host transmit path latency. > > >>> > > >>> For the latter, knowing the time from the start of the sendmsg call to > > >>> the moment the last byte hits the wire is most relevant. Or in absence > > >>> of (well defined) hardware support, the last byte being queued to the > > >>> device is the next best thing. > > > > > > Sounds to me like both cases have a legitimate use case, and we want > > > to support both. > > > > > > Implementation constraints are that storage for this timestamp > > > information is scarce and we cannot add new cold cacheline accesses in > > > the datapath. > > > > > > The simplest approach would be to unconditionally timestamp both the > > > first and last segment. With the same ID. Not terribly elegant. But it > > > works. > > > > > > If conditional, tx_flags has only one bit left. I think we can harvest > > > some, as not all defined bits are in use at the same stages in the > > > datapath, but that is not a trivial change. Some might also better be > > > set in the skb, instead of skb_shinfo. Which would also avoids > > > touching that cacheline. We could possibly repurpose bits from u32 > > > tskey. > > > > > > All that can come later. Initially, unless we can come up with > > > something more elegant, I would suggest that UDP follows the rule > > > established by TCP and timestamps the last byte. And we add an > > > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only > > > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in > > > __sock_tx_timestamp and is interpreted in __udp_gso_segment. > > > > > > > I don’t see how to practically TX timestamp the last byte of any packet > > (UDP GSO or otherwise). The best we could do is timestamp the last > > segment, or rather the time that the last segment is queued. Let me > > attempt to explain. > > > > First let’s look at software TX timestamps which are for are generated > > by skb_tx_timestamp() in nearly every network driver’s xmit routine. It > > states: > > > > —————————— cut ———————————— > > * Ethernet MAC Drivers should call this function in their hard_xmit() > > * function immediately before giving the sk_buff to the MAC hardware. > > —————————— cut ———————————— > > > > That means that the sk_buff will get timestamped just before rather > > than just after it is sent. To truly capture the timestamp of the last > > byte, this routine routine would have to be called a second time, right > > after sending to MAC hardware. Then the user program would have > > sort out the 2 timestamps. My guess is that this isn’t something that > > NIC vendors would be willing to implement in their drivers. > > > > So, the best we can do is timestamp is just before the last segment. > > Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. > > If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp > > occurs half way through the burst. But it may not be exactly half way > > because the segments may get queued much faster than wire rate. > > Therefore the time between segment 1 and segment 2 may be much > > much smaller than their spacing on the wire. I would not find this > > useful. > > For measuring host queueing latency, a timestamp at the existing > skb_tx_timestamp() for the last segment is perfectly informative. In most cases all segments will be sent in a single xmit_more train. In which case the device doorbell is rung when the last segment is queued. A device may also pause in the middle of a train, causing the rest of the list to be requeued and resent after a tx completion frees up descriptors and wakes the device. This seems like a relevant exception to be able to measure. That said, I am not opposed to the first segment, if we have to make a binary choice for a default. Either option has cons. See more specific revision requests in the v2 patch.