On Wed, Oct 28, 2020 at 2:35 PM Björn Töpel <bjorn.topel@xxxxxxxxx> wrote: > > Jakub suggested in [1] a "strict busy-polling mode with out > interrupts". This is a first stab at that. > > This series adds a new NAPI mode, called biased busy-polling, which is > an extension to the existing busy-polling mode. The new mode is > enabled on the socket layer, where a socket setting this option > "promisies" to busy-poll the NAPI context via a system call. When this > mode is enabled, the NAPI context will operate in a mode with > interrupts disabled. The kernel monitors that the busy-polling promise > is fulfilled by an internal watchdog. If the socket fail/stop > performing the busy-polling, the mode will be disabled. > > Biased busy-polling follows the same mechanism as the existing > busy-poll; The napi_id is reported to the socket via the skbuff. Later > commits will extend napi_id reporting to XDP, in order to work > correctly with XDP sockets. > > Let us walk through a flow of execution: > > 1. A socket sets the new SO_BIAS_BUSY_POLL socket option to true. The > socket now shows an intent of doing busy-polling. No data has been > received to the socket, so the napi_id of the socket is still 0 > (non-valid). As usual for busy-polling, the SO_BUSY_POLL option > also has to be non-zero for biased busy-polling. > > 2. Data is received on the socket changing the napi_id to non-zero. > > 3. The socket does a system call that has the busy-polling logic wired > up, e.g. recvfrom() for UDP sockets. The NAPI context is now marked > as biased busy-poll. The kernel watchdog is armed. If the NAPI > context is already running, it will try to finish as soon as > possible and move to busy-polling. If the NAPI context is not > running, it will execute the NAPI poll function for the > corresponding napi_id. > > 4. Goto 3, or wait until the watchdog timeout. > > The series is outlined as following: > Patch 1-2: Biased busy-polling, and option to set busy-poll budget. > Patch 3-6: Busy-poll plumbing for XDP sockets > Patch 7-9: Add busy-polling support to the xdpsock sample > > Performance UDP sockets: > > I hacked netperf to use non-blocking sockets, and looping over > recvfrom(). The following command-line was used: > $ netperf -H 192.168.1.1 -l 30 -t UDP_RR -v 2 -- \ > -o min_latency,mean_latency,max_latency,stddev_latency,transaction_rate > > Non-blocking: > 16,18.45,195,0.94,54070.369 > Non-blocking with biased busy-polling: > 15,16.59,38,0.70,60086.313 > But a fair comparison should be done using current busy-polling mode, which does not require netperf to use non-blocking mode in the first place ? Would disabling/rearming interrupts about 60,000 times per second bring any benefit ? Additional questions : - What happens to the gro_flush_timeout and accumulated TCP segments in GRO engine while the biased busy-polling is in use ? - What mechanism would avoid a potential 200 ms latency when the application wants to exit cleanly ? Presumably when/if SO_BIAS_BUSY_POLL is used to clear sk->sk_bias_busy_poll we need to make sure device interrupts are re-enabled. > Performance XDP sockets: > > Today, running XDP sockets sample on the same core as the softirq > handling, performance tanks mainly because we do not yield to > user-space when the XDP socket Rx queue is full. > # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r > Rx: 64Kpps > > # # biased busy-polling, budget 8 > # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 8 > Rx 9.9Mpps > # # biased busy-polling, budget 64 > # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 64 > Rx: 19.3Mpps > # # biased busy-polling, budget 256 > # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 256 > Rx: 21.4Mpps > # # biased busy-polling, budget 512 > # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 512 > Rx: 21.4Mpps > > Compared to the two-core case: > # taskset -c 4 ./xdpsock -i ens785f1 -q 20 -n 1 -r > Rx: 20.7Mpps > > We're getting better single-core performance than two, for this naïve > drop scenario. > > The above tests was done for the 'ice' driver. > > Some outstanding questions: > > * Does biased busy-polling make sense for non-XDP sockets? For a > dedicated queue, biased busy-polling has a strong case. When the > NAPI is shared with other sockets, it can affect the latencies of > sockets that were not explicity busy-poll enabled. Note that this > true for regular busy-polling as well, but the biased version is > stricter. > > * Currently busy-polling for UDP/TCP is only wired up in the recvmsg() > path. Does it make sense to extend that to sendmsg() as well? > > * Biased busy-polling only makes sense for non-blocking sockets. Reject > enabling of biased busy-polling unless the socket is non-blocking? > > * The watchdog is 200 ms. Should it be configurable? > > * Extending xdp_rxq_info_reg() with napi_id touches a lot of drivers, > and I've only verified the Intel ones. Some drivers initialize NAPI > (generating the napi_id) after the xdp_rxq_info_reg() call, which > maybe would open up for another API? I did not send this RFC to all > the driver authors. I'll do that for a patch proper series. > > * Today, enabling busy-polling require CAP_NET_ADMIN. For a NAPI > context that services multiple socket, this makes sense because one > socket can affect performance of other sockets. Now, for a > *dedicated* queue for say XDP socket, would it be OK to drop > CAP_NET_ADMIN, because it cannot affect other sockets/users? > > @Jakub Thanks for the early comments. I left the check in > napi_schedule_prep(), because I hit that for the Intel i40e driver; > forcing busy-polling on a core outside the interrupt affinity mask. > > [1] https://lore.kernel.org/netdev/20200925120652.10b8d7c5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > Björn Töpel (9): > net: introduce biased busy-polling > net: add SO_BUSY_POLL_BUDGET socket option > xsk: add support for recvmsg() > xsk: check need wakeup flag in sendmsg() > xsk: add busy-poll support for {recv,send}msg() > xsk: propagate napi_id to XDP socket Rx path > samples/bpf: use recvfrom() in xdpsock > samples/bpf: add busy-poll support to xdpsock > samples/bpf: add option to set the busy-poll budget > > arch/alpha/include/uapi/asm/socket.h | 3 + > arch/mips/include/uapi/asm/socket.h | 3 + > arch/parisc/include/uapi/asm/socket.h | 3 + > arch/sparc/include/uapi/asm/socket.h | 3 + > drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +- > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- > .../ethernet/cavium/thunder/nicvf_queues.c | 2 +- > .../net/ethernet/freescale/dpaa2/dpaa2-eth.c | 2 +- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +- > drivers/net/ethernet/intel/ice/ice_base.c | 4 +- > drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +- > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- > drivers/net/ethernet/marvell/mvneta.c | 2 +- > .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 +- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- > .../ethernet/netronome/nfp/nfp_net_common.c | 2 +- > drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +- > drivers/net/ethernet/sfc/rx_common.c | 2 +- > drivers/net/ethernet/socionext/netsec.c | 2 +- > drivers/net/ethernet/ti/cpsw_priv.c | 2 +- > drivers/net/hyperv/netvsc.c | 2 +- > drivers/net/tun.c | 2 +- > drivers/net/veth.c | 2 +- > drivers/net/virtio_net.c | 2 +- > drivers/net/xen-netfront.c | 2 +- > fs/eventpoll.c | 3 +- > include/linux/netdevice.h | 33 +++--- > include/net/busy_poll.h | 42 +++++-- > include/net/sock.h | 4 + > include/net/xdp.h | 3 +- > include/uapi/asm-generic/socket.h | 3 + > net/core/dev.c | 111 +++++++++++++++--- > net/core/sock.c | 19 +++ > net/core/xdp.c | 3 +- > net/xdp/xsk.c | 36 +++++- > net/xdp/xsk_buff_pool.c | 13 +- > samples/bpf/xdpsock_user.c | 53 +++++++-- > 37 files changed, 296 insertions(+), 85 deletions(-) > > > base-commit: 3cb12d27ff655e57e8efe3486dca2a22f4e30578 > -- > 2.27.0 >