Tossing in my .02 cents: I anticipate that most users of AF_XDP will want packet processing for a given RX queue occurring on a single core - otherwise we end up with cache delays. The usual model is one thread, one socket, one core, but this isn't enforced anywhere in the AF_XDP code and is up to the user to set this up. On 7 May 2019, at 11:24, Alexei Starovoitov wrote: > I'm not saying that we shouldn't do busy-poll. I'm saying it's > complimentary, but in all cases single core per af_xdp rq queue > with user thread pinning is preferred. So I think we're on the same page here. > Stack rx queues and af_xdp rx queues should look almost the same from > napi point of view. Stack -> normal napi in softirq. af_xdp -> new > kthread > to work with both poll and busy-poll. The only difference between > poll and busy-poll will be the running context: new kthread vs user > task. ... > A burst of 64 packets on stack queues or some other work in softirqd > will spike the latency for af_xdp queues if softirq is shared. True, but would it be shared? This goes back to the current model, which as used by Intel is: (channel == RX, TX, softirq) MLX, on the other hand, wants: (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq) Which would indeed lead to sharing. The more I look at the above, the stronger I start to dislike it. Perhaps this should be disallowed? I believe there was some mention at LSF/MM that the 'channel' concept was something specific to HW and really shouldn't be part of the SW API. > Hence the proposal for new napi_kthreads: > - user creates af_xdp socket and binds to _CPU_ X then > - driver allocates single af_xdp rq queue (queue ID doesn't need to be > exposed) > - spawns kthread pinned to cpu X > - configures irq for that af_xdp queue to fire on cpu X > - user space with the help of libbpf pins its processing thread to > that cpu X > - repeat above for as many af_xdp sockets as there as cpus > (its also ok to pick the same cpu X for different af_xdp socket > then new kthread is shared) > - user space configures hw to RSS to these set of af_xdp sockets. > since ethtool api is a mess I propose to use af_xdp api to do this > rss config From a high level point of view, this sounds quite sensible, but does need some details ironed out. The model above essentially enforces a model of: (af_xdp = RX.af_xdp + bound_cpu) (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq) (temporarily ignoring TX for right now) I forsee two issues with the above approach: 1. hardware limitations in the number of queues/rings 2. RSS/steering rules > - user creates af_xdp socket and binds to _CPU_ X then > - driver allocates single af_xdp rq queue (queue ID doesn't need to be > exposed) Here, the driver may not be able to create an arbitrary RQ, but may need to tear down/reuse an existing one used by the stack. This may not be an issue for modern hardware. > - user space configures hw to RSS to these set of af_xdp sockets. > since ethtool api is a mess I propose to use af_xdp api to do this > rss config Currently, RSS only steers default traffic. On a system with shared stack/af_xdp queues, there should be a way to split the traffic types, unless we're talking about a model where all traffic goes to AF_XDP. This classification has to be done by the NIC, since it comes before RSS steering - which currently means sending flow match rules to the NIC, which is less than ideal. I agree that the ethtool interface is non optimal, but it does make things clear to the user what's going on. Perhaps an af_xdp library that does some bookkeeping: - open af_xdp socket - define af_xdp_set as (classification, steering rules, other?) - bind socket to (cpu, af_xdp_set) - kernel: - pins calling thread to cpu - creates kthread if one doesn't exist, binds to irq and cpu - has driver create RQ.af_xdp, possibly replacing RQ.stack - applies (af_xdp_set) to NIC. Seems workable, but a little complicated? The complexity could be moved into a separate library. > imo that would be the simplest and performant way of using af_xdp. > All configuration apis are under libbpf (or libxdp if we choose to > fork it) > End result is one af_xdp rx queue - one napi - one kthread - one user > thread. > All pinned to the same cpu with irq on that cpu. > Both poll and busy-poll approaches will not bounce data between cpus. > No 'shadow' queues to speak of and should solve the issues that > folks were bringing up in different threads. Sounds like a sensible model from my POV. -- Jonathan