On 11/12/2020 00:30, Willem de Bruijn wrote: >>> If I understand correctly, you are trying to achieve a single delivery time. >>> The need for two separate timestamps passed along is only because the >>> kernel is unable to do the time base conversion. >> >> Yes, a correct point. >> >>> >>> Else, ETF could program the qdisc watchdog in system time and later, >>> on dequeue, convert skb->tstamp to the h/w time base before >>> passing it to the device. >> >> Or the skb->tstamp is HW time-stamp and the ETF convert it to system clock based. >> >>> >>> It's still not entirely clear to me why the packet has to be held by >>> ETF initially first, if it is held until delivery time by hardware >>> later. But more on that below. >> >> Let plot a simple scenario. >> App A send a packet with time-stamp 100. >> After arrive a second packet from App B with time-stamp 90. >> Without ETF, the second packet will have to wait till the interface hardware send the first packet on 100. >> Making the second packet late by 10 + first packet send time. >> Obviously other "normal" packets are send to the non-ETF queue, though they do not block ETF packets >> The ETF delta is a barrier that the application have to send the packet before to ensure the packet do not tossed. > > Got it. The assumption here is that devices are FIFO. That is not > necessarily the case, but I do not know whether it is in practice, > e.g., on the i210. > >> >>> >>> So far, the use case sounds a bit narrow and the use of two timestamp >>> fields for a single delivery event a bit of a hack. >> >> The definition of a hack is up to you > > Fair enough :) That wasn't very constructive feedback on my part. > >>> And one that does impose a cost in the hot path of many workloads >>> by adding a field the ip cookie, cork and writing to (possibly cold) >>> skb_shinfo for every packet. >> >> Most packets do not use skb->tstamp either, probably the cost of testing is higher then just copying. >> But perhaps if we copy 2 time-stamp we can add a condition for both. >> What do you think? > > I'd need to take a closer look at the skb_hwtstamps, which unlike > skb->tstamp lie in the skb_shared_data. If that is an otherwise cold > cacheline, then access would be expensive. Good point. We should review it. That can make a flag for using copying time-stamps into the SKB more feasible. > > The ipcm and cork are admittedly cheap and not worth a branch. But > still it is good to understand that this situation of unsynchronized > clocks is a common operation condition for the foreseeable future, not > an unfortunate constraint of a single piece of hardware. > > An extreme option would be moving everything behind a static_branch as > most hot paths will not have the feature enabled. But I'm not > seriously suggesting that for a few assignments. > >> The cookie and the cork are just intermediate from application to SKB, I do not think they cost much. >> Both writes of time stamp to the cookie and the cork are conditioned. >> >>> >>>>>>> Indeed, we want pacing offload to work for existing applications. >>>>>>> >>>>>> As the conversion of the PHC and the system clock is dynamic over time. >>>>>> How do you propse to achive it? >>>>> >>>>> Can you elaborate on this concern? >>>> >>>> Using single time stamp have 3 possible solutions: >>>> >>>> 1. Current solution, synchronize the system clock and the PHC. >>>> Application uses the system clock. >>>> The ETF can use the system clock for ordering and pass the packet to the driver on time >>>> The network interface hardware compare the time-stamp to the PHC. >>>> >>>> 2. The application convert the PHC time-stamp to system clock based. >>>> The ETF works as solution 1 >>>> The network driver convert the system clock time-stamp back to PHC time-stamp. >>>> This solution need a new Net-Link flag and modify the relevant network drivers. >>>> Yet this solution have 2 problems: >>>> * As applications today are not aware that system clock and PHC are not synchronized and >>>> therefore do not perform any conversion, most of them only use the system clock. >>>> * As the conversion in the network driver happens ~300 - 600 microseconds after >>>> the application send the packet. >>>> And as the PHC and system clock frequencies and offset can change during this period. >>>> The conversion will produce a different PHC time-stamp from the application original time-stamp. >>>> We require a precession of 1 nanoseconds of the PHC time-stamp. >>>> >>>> 3. The application uses PHC time-stamp for skb->tstamp >>>> The ETF convert the PHC time-stamp to system clock time-stamp. >>>> This solution require implementations on supporting reading PHC clocks >>>> from IRQ/kernel thread context in kernel space. >>> >>> ETF has to release the packet well in advance of the hardware >>> timestamp for the packet to arrive at the device on time. In practice >>> I would expect this delta parameter to be at least at usec timescale. >>> That gives some wiggle room with regard to s/w tstamp, at least. >> >> Yes, the author of the ETF uses a delta of 300 usec. >> The interface I use for testing, Intel I210 need around 100 usec to 150 usec. >> I believe it is related to PCIe speed of transferring the data on time and probably similar to other interfaces using PCIe. >> If you overflow the interface hardware with high traffic the delta is much higher. >> The clocks conversion error of the application is characteristic around ~1 usec to 5 usec for up to 10 ms sending a head. >> >>> >>> If changes in clock distance are relatively infrequent, could this >>> clock diff be a qdisc parameter, updated infrequently outside the >>> packet path? >> >> As the clocks are updating of both frequency and offset dynamically make it very hard to perform. >> The rate of the update of the PHC depends on PTP setting (usually around 1 second). >> The rate of the update of the system clock depends how you synchronize it ( I assume it is similar or slower). >> But user may require and use higher rates. It is only penalty by more traffic and CPU. >> Bare in mind the 2 clocks synchronization are independent, the cross can be unpredictable. >> >> The ETF would have to "know" on which packets we use the previous update and which are the last update. >> And hope we do not "miss" updates. >> >> And we would need a "service" to update these values with proper configuration, sound like overhead to me. > > Ack. Thanks for the operating context. I didn't know these constraints > well enough. Agreed that this is not a very feasible approach then. > >> Another point. >> The delta includes the PCIe DMA transfer speed, this is a hardware limitation. >> The idea of TSN is that the application send the packet as closer as possible to the hardware send. >> Increase the error of the clocks conversion defy the purpose of TSN. >> >> A more reasonable is to track the clocks inside the kernel. >> As we mention on solution 3. >> >>> >>> It would even be preferable if the qdisc and core stack could be >>> ignorant of such hardware clocks and the time base is converted by the >>> device driver when encoding skb->tstamp into the tx descriptor. Is the >>> device hardware clock readable by the driver? >> >> All drivers that support PTP (IEEE 1558) have to read the PHC. >> PTP is mandatory for TSN. >> But some drivers might be limited on which context they can read the PHC. >> This is a question to the vendors. >> For example Intel I210 allow reading the PHC. >> >> However the kernel POSIX layer uses application context lockings. >> >>> >>> From the above, it sounds like this is not trivial. >> >> I am not sure if it so complicated. >> But as the Linux maintainers want to keep the Linux kernel with a single system clock. >> It might be more of a political question, or perhaps a better planning then I did. > > This would seem the preferable option to me: use a kernel time base > throughout the stack and limit knowledge of the hardware clock to the > relevant hardware driver. > > If that is infeasible, then I don't immediately see an alternative to > the current dual timestamp field variant, either. > >>> >>> I don't know which exact device you're targeting. Is it an in-tree driver? >> >> ETF uses ethernet interfaces with IEEE 1558 and 802.1Qbv or 802.1Qbu. >> Interfaces that support TSN, https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTime-Sensitive_Networking&data=04%7C01%7Cerez.geva.ext%40siemens.com%7C51af403dfc0041ced22e08d89d63db25%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637432399491933753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9xEP%2F2w3hrGItE0%2Btl0Bsy1xtUXwv4xUowuZ9QInXmI%3D&reserved=0 >> I use Intel I210 at the moment. >> As of 5.10-rc6, there are 4 drivers >> kernel-etf/drivers/net/ethernet (etf-5.10-rc6)$ find -name '*.c' | xargs grep -r TC_SETUP_QDISC_ETF >> ./freescale/enetc/enetc.c: case TC_SETUP_QDISC_ETF: >> ./stmicro/stmmac/stmmac_main.c: case TC_SETUP_QDISC_ETF: >> ./intel/igc/igc_main.c: case TC_SETUP_QDISC_ETF: >> ./intel/igb/igb_main.c: case TC_SETUP_QDISC_ETF: >> There are more vendors like >> Renesas that have a driver for the RZ-G SOC. >> Broadcom have chips that support TSN, but they do not publish the code. >> I believe that other vendors will add TSN support as it becomes more popular. > > Very clear. Thanks. > >>> >>>> Just for clarification: >>>> ETF as all Net-Link, only uses system clock (the TAI) >>>> The network interface hardware only uses the PHC. >>>> Nor Net-Link neither the driver perform any conversions. >>>> The Kernel does not provide and clock conversion beside system clock. >>>> Linux kernel is a single clock system. >>>> >>>>> >>>>> The simplest solution for offloading pacing would be to interpret >>>>> skb->tstamp either for software pacing, or skip software pacing if the >>>>> device advertises a NETIF_F hardware pacing feature. >>>> >>>> That will defy the purpose of ETF. >>>> ETF exist for ordering packets. >>>> Why should the device driver defer it? >>>> Simply do not use the QDISC for this interface. >>> >>> ETF queues packets until their delivery time is reached. It does not >>> order for any other reason than to calculate the next qdisc watchdog >>> event, really. >> >> No, ETF also order the packets on .enqueue = etf_enqueue_timesortedlist() >> Our patch suggest to order them by hardware time stamp. >> And leave the watchdog setting using skb->tstamp that hold system clock TAI time-stamp. >> >>> >>> If h/w can do the same and the driver can convert skb->tstamp to the >>> right timebase, indeed no qdisc is needed for pacing. But there may be >>> a need for selective h/w offload if h/w has additional constraints, >>> such as on the number of packets outstanding or time horizon. >> >> The driver do not order the packets, it send packets in the order of arrival. >> We can add ETF component to each relevant driver, but do we want to add Net-Link features to drivers? >> I think the purpose is to make the drivers as small as possible and leave common intelligence in the Net-Link layer. > > I was thinking of devices that implement ETF in hardware for full > pacing hardware offload. Not in the driver. > >>> >>>>> >>>>> Clockbase is an issue. The device driver may have to convert to >>>>> whatever format the device expects when copying skb->tstamp in the >>>>> device tx descriptor. >>>> >>>> We do hope our definition is clear. >>>> In the current kernel skb->tstamp uses system clock. >>>> The hardware time-stamp is PHC based, as it is used today for PTP two steps. >>>> We only propose to use the same hardware time-stamp. >>>> >>>> Passing the hardware time-stamp to the skb->tstamp might seems a bit tricky >>>> The gaol is the leave the driver unaware to whether we >>>> * Synchronizing the PHC and system clock >>>> * The ETF pass the hardware time-stamp to skb->tstamp >>>> Only the applications and the ETF are aware. >>>> The application can detect by checking the ETF flag. >>>> The ETF flags are part of the network administration. >>>> That also configure the PTP and the system clock synchronization. >>>> >>>>> >>>>>> >>>>>>> It only requires that pacing qdiscs, both sch_etf and sch_fq, >>>>>>> optionally skip queuing in their .enqueue callback and instead allow >>>>>>> the skb to pass to the device driver as is, with skb->tstamp set. Only >>>>>>> to devices that advertise support for h/w pacing offload. >>>>>>> >>>>>> I did not use "Fair Queue traffic policing". >>>>>> As for ETF, it is all about ordering packets from different applications. >>>>>> How can we achive it with skiping queuing? >>>>>> Could you elaborate on this point? >>>>> >>>>> The qdisc can only defer pacing to hardware if hardware can ensure the >>>>> same invariants on ordering, of course. >>>> >>>> Yes, this is why we suggest ETF order packets using the hardware time-stamp. >>>> And pass the packet based on system time. >>>> So ETF query the system clock only and not the PHC. >>> >>> On which note: with this patch set all applications have to agree to >>> use h/w time base in etf_enqueue_timesortedlist. In practice that >>> makes this h/w mode a qdisc used by a single process? >> >> A single process theoretically does not need ETF, just set the skb-> tstamp and use a pass through queue. >> However the only way now to set TC_SETUP_QDISC_ETF in the driver is using ETF. > > Yes, and I'd like to eventually get rid of this constraint. > > >> The ETF QDISC is per network interface. >> So all application that uses a single network interface have to comply to the QDISC configuration. >> Sound like any other new feature in the NetLink. >> >> Theoretically a single network interface could participate in 2 TSN/PTP domains. >> In that case you can create one QDISC without "use hardware time-stamp" for first TSN/PTP domain and synchronize the PHC to system clock. >> And use the second one with a QDISC that use hardware time-stamp. >> You will need a driver/hardware that support multiple PHCs. >> The separation of the domains could be using VLANs. >> >> Note: A TSN domain is bound to a PTP domain. >> >>> >>>>> >>>>> Btw: this is quite a long list of CC:s >>>>> >>>> I need to update my company colleagues as well as the Linux group. >>> >>> Of course. But even ignoring that this is still quite a large list (> 40). >>> >>> My response yesterday was actually blocked as a result ;) Retrying now. >>> >> >> I left 5 people from Siemens, I hope it improves. >> >> >> Thanks for your comments and enlightenments, they are very useful >> Erez