On 08/12, Joe Damato wrote: > Greetings: > > Martin Karsten (CC'd) and I have been collaborating on some ideas about > ways of reducing tail latency when using epoll-based busy poll and we'd > love to get feedback from the list on the code in this series. This is > the idea I mentioned at netdev conf, for those who were there. Barring > any major issues, we hope to submit this officially shortly after RFC. > > The basic idea for suspending IRQs in this manner was described in an > earlier paper presented at Sigmetrics 2024 [1]. Let me explicitly call out the paper. Very nice analysis! > Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for > epoll_params") introduced the ability to enable or disable preferred > busy poll mode on a specific epoll context using an ioctl > (EPIOCSPARAMS). > > This series extends preferred busy poll mode by adding a sysfs parameter, > irq_suspend_timeout, which when used in combination with preferred busy > poll suspends device IRQs up to irq_suspend_timeout nanoseconds. > > Important call outs: > - Enabling per epoll-context preferred busy poll will now effectively > lead to a nonblocking iteration through napi_busy_loop, even when > busy_poll_usecs is 0. See patch 4. > > - Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa: > microchip: ksz9477: split half-duplex monitoring function"), but > may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate > data-race of busy_poll_usecs") picked up by the vfs folks makes its way > into net-next. > > - In the future, time permitting, I hope to enable support for > napi_defer_hard_irqs, gro_flush_timeout (introduced in commit > 6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and > irq_suspend_timeout (introduced in this series) on a per-NAPI basis > (presumably via netdev-genl). > > ~ Description of the changes > > The overall idea is that IRQ suspension is introduced via a sysfs > parameter which controls the maximum time that IRQs can be suspended. > > Here's how it is intended to work: > - An administrator sets the existing sysfs parameters for > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > > - An administrator sets the new sysfs parameter irq_suspend_timeout > to a larger value than gro-timeout to enable IRQ suspension. Can you expand more on what's the problem with the existing gro_flush_timeout? Is it defer_hard_irqs_count? Or you want a separate timeout only for the perfer_busy_poll case(why?)? Because looking at the first two patches, you essentially replace all usages of gro_flush_timeout with a new variable and I don't see how it helps. Maybe expand more on what code paths are we trying to improve? Existing busy polling code is not super readable, so would be nice to simplify it a bit in the process (if possible) instead of adding one more tunable. > - The user application issues the existing epoll ioctl to set the > prefer_busy_poll flag on the epoll context. > > - The user application then calls epoll_wait to busy poll for network > events, as it normally would. > > - If epoll_wait returns events to userland, IRQ are suspended for the > duration of irq_suspend_timeout. > > - If epoll_wait finds no events and the thread is about to go to > sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is > resumed. > > As long as epoll_wait is retrieving events, IRQs (and softirq > processing) for the NAPI being polled remain disabled. Unless IRQ > suspension is continued by subsequent calls to epoll_wait, it > automatically times out after the irq_suspend_timeout timer expires. > > When network traffic reduces, eventually a busy poll loop in the kernel > will retrieve no data. When this occurs, regular deferral using > gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular > deferral is also immediately re-enabled when the epoll context is > destroyed. > > ~ Benchmark configs & descriptions > > These changes were benchmarked with memcached [2] using the > benchmarking tool mutilate [3]. > > To facilitate benchmarking, a small patch [4] was applied to > memcached 1.6.29 (the latest memcached release as of this RFC) to allow > setting per-epoll context preferred busy poll and other settings > via environment variables. > > Multiple scenarios were benchmarked as described below > and the scripts used for producing these results can be found on > github [5]. > > (note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID > by passing -N to memcached): > > - base: Other than NAPI-based traffic splitting, no other options are > enabled. > - busy: > - set defer_hard_irqs to 100 > - set gro_flush_timeout to 200,000 > - enable busy poll via the existing ioctl (busy_poll_usecs = 64, > busy_poll_budget = 64, prefer_busy_poll = true) > - deferX: > - set defer_hard_irqs to 100 > - set gro_flush_timeout to X,000 [..] > - suspendX: > - set defer_hard_irqs to 100 > - set gro_flush_timeout to X,000 > - set irq_suspend_timeout to 20,000,000 > - enable busy poll via the existing ioctl (busy_poll_usecs = 0, > busy_poll_budget = 64, prefer_busy_poll = true) What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback to busy_poll sysctl value?