On Tue, May 7, 2019 at 8:24 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote: > > On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote: > > > > This RFC proposes to add busy-poll support to AF_XDP sockets. With > > > > busy-poll, the driver is executed in process context by calling the > > > > poll() syscall. The main advantage with this is that all processing > > > > occurs on a single core. This eliminates the core-to-core cache > > > > transfers that occur between the application and the softirqd > > > > processing on another core, that occurs without busy-poll. From a > > > > systems point of view, it also provides an advatage that we do not > > > > have to provision extra cores in the system to handle > > > > ksoftirqd/softirq processing, as all processing is done on the single > > > > core that executes the application. The drawback of busy-poll is that > > > > max throughput seen from a single application will be lower (due to > > > > the syscall), but on a per core basis it will often be higher as > > > > the normal mode runs on two cores and busy-poll on a single one. > > > > > > > > The semantics of busy-poll from the application point of view are the > > > > following: > > > > > > > > * The application is required to call poll() to drive rx and tx > > > > processing. There is no guarantee that softirq and interrupts will > > > > do this for you. This is in contrast with the current > > > > implementations of busy-poll that are opportunistic in the sense > > > > that packets might be received/transmitted by busy-poll or > > > > softirqd. (In this patch set, softirq/ksoftirqd will kick in at high > > > > loads just as the current opportunistic implementations, but I would > > > > like to get to a point where this is not the case for busy-poll > > > > enabled XDP sockets, as this slows down performance considerably and > > > > starts to use one more core for the softirq processing. The end goal > > > > is for only poll() to drive the napi loop when busy-poll is enabled > > > > on an AF_XDP socket. More about this later.) > > > > > > > > * It should be enabled on a per socket basis. No global enablement, i.e. > > > > the XDP socket busy-poll will not care about the current > > > > /proc/sys/net/core/busy_poll and busy_read global enablement > > > > mechanisms. > > > > > > > > * The batch size (how many packets that are processed every time the > > > > napi function in the driver is called, i.e. the weight parameter) > > > > should be configurable. Currently, the busy-poll size of AF_INET > > > > sockets is set to 8, but for AF_XDP sockets this is too small as the > > > > amount of processing per packet is much smaller with AF_XDP. This > > > > should be configurable on a per socket basis. > > > > > > > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll() > > > > call the napi contexts of all of them should be executed. This is in > > > > contrast to the AF_INET busy-poll that quits after the fist one that > > > > finds any packets. We need all napi contexts to be executed due to > > > > the first requirement in this list. The behaviour we want is much more > > > > like regular sockets in that all of them are checked in the poll > > > > call. > > > > > > > > * Should be possible to mix AF_XDP busy-poll sockets with any other > > > > sockets including busy-poll AF_INET ones in a single poll() call > > > > without any change to semantics or the behaviour of any of those > > > > socket types. > > > > > > > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll > > > > mode return POLLERR if the fill ring is empty or the completion > > > > queue is full. > > > > > > > > Busy-poll support is enabled by calling a new setsockopt called > > > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value > > > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and > > > > any other value will return an error. > > > > > > > > A typical packet processing rxdrop loop with busy-poll will look something > > > > like this: > > > > > > > > for (i = 0; i < num_socks; i++) { > > > > fds[i].fd = xsk_socket__fd(xsks[i]->xsk); > > > > fds[i].events = POLLIN; > > > > } > > > > > > > > for (;;) { > > > > ret = poll(fds, num_socks, 0); > > > > if (ret <= 0) > > > > continue; > > > > > > > > for (i = 0; i < num_socks; i++) > > > > rx_drop(xsks[i], fds); /* The actual application */ > > > > } > > > > > > > > Need some advice around this issue please: > > > > > > > > In this patch set, softirq/ksoftirqd will kick in at high loads and > > > > render the busy poll support useless as the execution is now happening > > > > in the same way as without busy-poll support. Everything works from an > > > > application perspective but this defeats the purpose of the support > > > > and also consumes an extra core. What I would like to accomplish when > > > > > > Not sure what you mean by 'extra core' . > > > The above poll+rx_drop is executed for every af_xdp socket > > > and there are N cpus processing exactly N af_xdp sockets. > > > Where is 'extra core'? > > > Are you suggesting a model where single core will be busy-polling > > > all af_xdp sockets? and then waking processing threads? > > > or single core will process all sockets? > > > I think the af_xdp model should be flexible and allow easy out-of-the-box > > > experience, but it should be optimized for 'ideal' user that > > > does the-right-thing from max packet-per-second point of view. > > > I thought we've already converged on the model where af_xdp hw rx queues > > > bind one-to-one to af_xdp sockets and user space pins processing > > > threads one-to-one to af_xdp sockets on corresponding cpus... > > > If so that's the model to optimize for on the kernel side > > > while keeping all other user configurations functional. > > > > > > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never > > > > invoked for the traffic that goes to this socket. This way, we would > > > > get better performance on a per core basis and also get the same > > > > behaviour independent of load. > > > > > > I suspect separate rx kthreads of af_xdp socket processing is necessary > > > with and without busy-poll exactly because of 'high load' case > > > you've described. > > > If we do this additional rx-kthread model why differentiate > > > between busy-polling and polling? > > > > > > af_xdp rx queue is completely different form stack rx queue because > > > of target dma address setup. > > > Using stack's napi ksoftirqd threads for processing af_xdp queues creates > > > the fairness issues. Isn't it better to have separate kthreads for them > > > and let scheduler deal with fairness among af_xdp processing and stack? > > > > When using ordinary poll() on an AF_XDP socket, the application will > > run on one core and the driver processing will run on another in > > softirq/ksoftirqd context. (Either due to explicit core and irq > > pinning or due to the scheduler or irqbalance moving the two threads > > apart.) In AF_XDP busy-poll mode of this RFC, I would like the > > application and the driver processing to occur on a single core, thus > > there is no "extra" driver core involved that need to be taken into > > account when sizing and/or provisioning the system. The napi context > > is in this mode invoked from syscall context when executing the poll > > syscall from the application. > > > > Executing the app and the driver on the same core could of course be > > accomplished already today by pinning the application and the driver > > interrupt to the same core, but that would not be that efficient due > > to context switching between the two. > > Have you benchmarked it? > I don't think context switch will be that noticable when kpti is off. > napi processes 64 packets descriptors and switches back to user to > do payload processing of these packets. > I would think that the same job is on two different cores would be > a bit more performant with user code consuming close to 100% > and softirq is single digit %. Say it's 10%. > I believe combining the two on single core is not 100 + 10 since > there is no cache bouncing. So Mpps from two cores setup will > reduce by 2-3% instead of 10%. > There is a cost of going to sleep and being woken up from poll(), > but 64 packets is probably large enough number to amortize. > If not, have you tried to bump napi budget to say 256 for af_xdp rx queues? > Busy-poll avoids sleep/wakeup overhead and probably can make > this scheme work with lower batching (like 64), but fundamentally > they're the same thing. > I'm not saying that we shouldn't do busy-poll. I'm saying it's > complimentary, but in all cases single core per af_xdp rq queue > with user thread pinning is preferred. > > > A more efficient way would be to > > call the napi loop from within the poll() syscall when you are inside > > the kernel anyway. This is what the classical busy-poll mechanism > > operating on AF_INET sockets does. Inside the poll() call, it executes > > the napi context of the driver until it finds a packet (if it is rx) > > and then returns to the application that then processes the packets. I > > would like to adopt something quite similar for AF_XDP sockets. (Some > > of the differences can be found at the top of the original post.) > > > > From an API point of view with busy-poll of AF_XDP sockets, the user > > would bind to a queue number and taskset its application to a specific > > core and both the app and the driver execution would only occur on > > that core. This is in my mind simpler than with regular poll or AF_XDP > > using no syscalls on rx (i.e. current state), in which you bind to a > > queue, taskset your application to a core and then you also have to > > take care to route the interrupt of the queue you bound to to another > > core that will execute the driver part in the kernel. So the model is > > in both cases still one core - one socket - one napi. (Users can of > > course create multiple sockets in an app if they desire.) > > > > The main reasons I would like to introduce busy-poll for AF_XDP > > sockets are: > > > > * It is simpler to provision, see arguments above. Both application > > and driver runs efficiently on the same core. > > > > * It is faster (on a per core basis) since we do not have any core to > > core communication. All header and descriptor transfers between > > kernel and application are core local which is much > > faster. Scalability will also be better. E.g., 64 bytes desc + 64 > > bytes packet header = 128 bytes per packet less on the interconnect > > between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores > > this will be ~400Gbit/s of interconnect traffic less with busy-poll. > > exactly. don't make cpu do this core-to-core stuff. > pin one rx to one core. > > > * It provides a way to seamlessly replace user-space drivers in DPDK > > with Linux drivers in kernel space. (Do not think I have to argue > > why this is a good idea on this list ;-).) The DPDK model is that > > application and driver run on the same core since they are both in > > user space. If we can provide the same model (both running > > efficiently on the same core, NOT drivers in user-space) with > > AF_XDP, it is easy for DPDK users to make the switch. Compare this > > to the current way where there are both application cores and > > driver/ksoftirqd cores. If a systems builder had 12 cores in his > > appliance box and they had 12 instances of a DPDK app, one on each > > core, how would he/she reason when repartitioning between > > application and driver cores? 8 application cores and 4 driver > > cores, or 6 of each? Maybe it is also packet dependent? Etc. Much > > simpler to migrate if we had an efficient way to run both of them on > > the same core. > > > > Why no interrupt? That should have been: no interrupts enabled to > > start with. We would like to avoid interrupts as much as possible > > since when they trigger, we will revert to the non busy-poll model, > > i.e. processing on two separate cores, and the advantages from above > > will disappear. How to accomplish this? > > > > * One way would be to create a napi context with the queue we have > > bound to but with no interrupt associated with it, or it being > > disabled. The socket would in that case only be able to receive and > > send packets when calling the poll() syscall. If you do not call > > poll(), you do not get any packets, nor are any packets sent. It > > would only be possible to support this with a poll() timeout value > > of zero. This would have the best performance > > > > * Maybe we could support timeout values >0 by re-enabling the interrupt > > at some point. When calling poll(), the core would invoke the napi > > context repeatedly with the interrupt of that napi disabled until it > > found a packet, but max for a period of time up until the busy poll > > timeout (like regular busy poll today does). If that times out, we > > go up to the regular timeout of the poll() call and enable > > interrupts of the queue associated with the napi and put the process > > to sleep. Once woken up by an interrupt, the interrupt of the napi > > would be disabled again and control returned to the application. We > > would with this scheme process the vast majority of packets locally > > on a core with interrupts disabled and with good performance and > > only when we have low load and are sleeping/waiting in poll would we > > process some packets using interrupts on the core that the > > interrupt has been bound to. > > I think both 'no interrupt' solutions are challenging for users. > Stack rx queues and af_xdp rx queues should look almost the same from > napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread > to work with both poll and busy-poll. The only difference between > poll and busy-poll will be the running context: new kthread vs user task. > If busy-poll drained the queue then new kthread napi has no work to do. > No irq approach could be marginally faster, but more error prone. > With new kthread the user space will still work in all configuration. > Even when single user task is processing many af_xdp sockets. > > I'm proposing new kthread only partially for performance reasons, but > mainly to avoid sharing stack rx and af_xdp queues within the same softirq. > Currently we share softirqd for stack napis for all NICs in the system, > but af_xdp depends on isolated processing. > Ideally we have rss into N queues for stack and rss into M af_xdp sockets. > The same host will be receive traffic on both. > Even if we rss stack queues to one set of cpus and af_xdp on another cpus > softirqds are doing work on all cpus. > A burst of 64 packets on stack queues or some other work in softirqd > will spike the latency for af_xdp queues if softirq is shared. > Hence the proposal for new napi_kthreads: > - user creates af_xdp socket and binds to _CPU_ X then > - driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed) > - spawns kthread pinned to cpu X > - configures irq for that af_xdp queue to fire on cpu X > - user space with the help of libbpf pins its processing thread to that cpu X > - repeat above for as many af_xdp sockets as there as cpus > (its also ok to pick the same cpu X for different af_xdp socket > then new kthread is shared) > - user space configures hw to RSS to these set of af_xdp sockets. > since ethtool api is a mess I propose to use af_xdp api to do this rss config > > imo that would be the simplest and performant way of using af_xdp. > All configuration apis are under libbpf (or libxdp if we choose to fork it) > End result is one af_xdp rx queue - one napi - one kthread - one user thread. > All pinned to the same cpu with irq on that cpu. > Both poll and busy-poll approaches will not bounce data between cpus. > No 'shadow' queues to speak of and should solve the issues that > folks were bringing up in different threads. > How crazy does it sound? Actually, it sounds remarkably sane :-). It will create something quite similar to what I have been wanting, but you take it at least two steps further. Did not think about introducing a separate kthread as a potential solution, and the user space configuration of RSS (and maybe other flow steering mechanisms) from AF_XDP Björn and I have only been loosely talking about. Anyway, I am producing performance numbers for the options that we have talked about. I will get back to you with them as soon as I have them and we can continue the discussions based on those. Thanks: Magnus