Greetings: Martin Karsten (CC'd) and I have been collaborating on some ideas about ways of reducing tail latency when using epoll-based busy poll and we'd love to get feedback from the list on the code in this series. This is the idea I mentioned at netdev conf, for those who were there. Barring any major issues, we hope to submit this officially shortly after RFC. The basic idea for suspending IRQs in this manner was described in an earlier paper presented at Sigmetrics 2024 [1]. Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for epoll_params") introduced the ability to enable or disable preferred busy poll mode on a specific epoll context using an ioctl (EPIOCSPARAMS). This series extends preferred busy poll mode by adding a sysfs parameter, irq_suspend_timeout, which when used in combination with preferred busy poll suspends device IRQs up to irq_suspend_timeout nanoseconds. Important call outs: - Enabling per epoll-context preferred busy poll will now effectively lead to a nonblocking iteration through napi_busy_loop, even when busy_poll_usecs is 0. See patch 4. - Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa: microchip: ksz9477: split half-duplex monitoring function"), but may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate data-race of busy_poll_usecs") picked up by the vfs folks makes its way into net-next. - In the future, time permitting, I hope to enable support for napi_defer_hard_irqs, gro_flush_timeout (introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and irq_suspend_timeout (introduced in this series) on a per-NAPI basis (presumably via netdev-genl). ~ Description of the changes The overall idea is that IRQ suspension is introduced via a sysfs parameter which controls the maximum time that IRQs can be suspended. Here's how it is intended to work: - An administrator sets the existing sysfs parameters for defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. - An administrator sets the new sysfs parameter irq_suspend_timeout to a larger value than gro-timeout to enable IRQ suspension. - The user application issues the existing epoll ioctl to set the prefer_busy_poll flag on the epoll context. - The user application then calls epoll_wait to busy poll for network events, as it normally would. - If epoll_wait returns events to userland, IRQ are suspended for the duration of irq_suspend_timeout. - If epoll_wait finds no events and the thread is about to go to sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is resumed. As long as epoll_wait is retrieving events, IRQs (and softirq processing) for the NAPI being polled remain disabled. Unless IRQ suspension is continued by subsequent calls to epoll_wait, it automatically times out after the irq_suspend_timeout timer expires. When network traffic reduces, eventually a busy poll loop in the kernel will retrieve no data. When this occurs, regular deferral using gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular deferral is also immediately re-enabled when the epoll context is destroyed. ~ Benchmark configs & descriptions These changes were benchmarked with memcached [2] using the benchmarking tool mutilate [3]. To facilitate benchmarking, a small patch [4] was applied to memcached 1.6.29 (the latest memcached release as of this RFC) to allow setting per-epoll context preferred busy poll and other settings via environment variables. Multiple scenarios were benchmarked as described below and the scripts used for producing these results can be found on github [5]. (note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing -N to memcached): - base: Other than NAPI-based traffic splitting, no other options are enabled. - busy: - set defer_hard_irqs to 100 - set gro_flush_timeout to 200,000 - enable busy poll via the existing ioctl (busy_poll_usecs = 64, busy_poll_budget = 64, prefer_busy_poll = true) - deferX: - set defer_hard_irqs to 100 - set gro_flush_timeout to X,000 - suspendX: - set defer_hard_irqs to 100 - set gro_flush_timeout to X,000 - set irq_suspend_timeout to 20,000,000 - enable busy poll via the existing ioctl (busy_poll_usecs = 0, busy_poll_budget = 64, prefer_busy_poll = true) ~ Benchmark results Tested on: Single socket AMD EPYC 7662 64-Core Processor Hyperthreading disabled 4 NUMA Zones (NPS=4) 16 CPUs per NUMA zone (64 cores total) 2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC The test machine is configured such that a single interface has 8 RX queues. The queues' IRQs and memcached are pinned to CPUs that are NUMA-local to the interface which is under test. memcached binds to the ipv4 address on the configured interface. The NIC's interrupts coalescing configuration are left at boot-time defaults. The overall takeaway from the results below is that the new mechanism (suspend20, see below) results in reduced 99th percentile latency and increased QPS in the MAX QPS case (compared to the other cases), and reduced latency in the lower QPS cases for comparable CPU usage to the base case (and less CPU than the busy case). base load qps avglat 95%lat 99%lat cpu 200K 199982 109 225 385 30 400K 400054 138 262 676 44 600K 599968 165 396 737 64 800K 800002 353 1136 2098 83 1000K 964960 3202 5556 7003 98 MAX 957274 4255 5526 6843 100 busy load qps avglat 95%lat 99%lat cpu 200K 199936 101 239 287 57 400K 399795 81 230 302 83 600K 599797 65 169 264 95 800K 799789 67 145 221 99 1000K 1000135 97 186 287 100 MAX 1079228 3752 7481 12634 98 defer20 load qps avglat 95%lat 99%lat cpu 200K 200052 60 130 156 28 400K 399797 67 140 176 49 600K 600049 94 189 483 68 800K 800106 246 959 2201 88 1000K 857377 4377 5674 5830 100 MAX 974672 4162 5454 5815 100 defer200 load qps avglat 95%lat 99%lat cpu 200K 200029 165 258 316 18 400K 399978 183 280 340 32 600K 599818 205 310 367 46 800K 799869 265 439 829 73 1000K 995961 2307 5163 7027 98 MAX 1050680 3837 5020 5596 100 suspend20 load qps avglat 95%lat 99%lat cpu 200K 199968 58 128 161 31 400K 400191 61 135 175 51 600K 599872 67 142 196 66 800K 800050 78 153 220 82 1000K 999638 101 194 292 91 MAX 1144308 3596 3961 4155 100 suspend200 load qps avglat 95%lat 99%lat cpu 200K 199973 149 251 313 20 400K 399957 154 270 331 35 600K 599878 157 284 351 51 800K 800091 158 293 359 65 1000K 1000399 173 311 393 85 MAX 1128033 3636 4210 4381 100 Thanks, Martin and Joe [1]: https://doi.org/10.1145/3626780 [2]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt [3]: https://github.com/leverich/mutilate [4]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch [5]: https://github.com/martinkarsten/irqsuspend Thanks, Martin and Joe Martin Karsten (5): net: Add sysfs parameter irq_suspend_timeout net: Suspend softirq when prefer_busy_poll is set net: Add control functions for irq suspension eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set eventpoll: Control irq suspension for prefer_busy_poll Documentation/networking/napi.rst | 3 ++ fs/eventpoll.c | 26 +++++++++++++-- include/linux/netdevice.h | 2 ++ include/net/busy_poll.h | 3 ++ net/core/dev.c | 55 +++++++++++++++++++++++++++---- net/core/net-sysfs.c | 18 ++++++++++ 6 files changed, 98 insertions(+), 9 deletions(-) -- 2.25.1