On 02/05, Joe Damato wrote: > Greetings: > > Welcome to v6. > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to > epoll with socket fds.") by allowing user applications to enable > epoll-based busy polling, set a busy poll packet budget, and enable or > disable prefer busy poll on a per epoll context basis. > > This makes epoll-based busy polling much more usable for user > applications than the current system-wide sysctl and hardcoded budget. > > To allow for this, two ioctls have been added for epoll contexts for > getting and setting a new struct, struct epoll_params. > > ioctl was chosen vs a new syscall after reviewing a suggestion by Willem > de Bruijn [1]. I am open to using a new syscall instead of an ioctl, but it > seemed that: > - Busy poll affects all existing epoll_wait and epoll_pwait variants in > the same way, so new verions of many syscalls might be needed. It > seems much simpler for users to use the correct > epoll_wait/epoll_pwait for their app and add a call to ioctl to enable > or disable busy poll as needed. This also probably means less work to > get an existing epoll app using busy poll. > > - previously added epoll_pwait2 helped to bring epoll closer to > existing syscalls (like pselect and ppoll) and this busy poll change > reflected as a new syscall would not have the same effect. > > Note: patch 1/4 as of v4 uses an or (||) instead of an xor. I thought about > it some more and I realized that if the user enables both the per-epoll > context setting and the system wide sysctl, then busy poll should be > enabled and not disabled. Using xor doesn't seem to make much sense after > thinking through this a bit. > > Longer explanation: > > Presently epoll has support for a very useful form of busy poll based on > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [2]). > > This form of busy poll allows epoll_wait to drive NAPI packet processing > which allows for a few interesting user application designs which can > reduce latency and also potentially improve L2/L3 cache hit rates by > deferring NAPI until userland has finished its work. > > The documentation available on this is, IMHO, a bit confusing so please > allow me to explain how one might use this: > > 1. Ensure each application thread has its own epoll instance mapping > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to > direct connections with specific dest ports to these queues. > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy > polling will occur. This can help avoid the userland app from being > pre-empted by a hard IRQ while userland is running. Note this means that > userland must take care to call epoll_wait and not take too long in > userland since it now drives NAPI via epoll_wait. > > 3. Optionally: Consider using napi_defer_hard_irqs and gro_flush_timeout to > further restrict IRQ generation from the NIC. These settings are > system-wide so their impact must be carefully weighed against the running > applications. > > 4. Ensure that all incoming connections added to an epoll instance > have the same NAPI ID. This can be done with a BPF filter when > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single > accept thread is used which dispatches incoming connections to threads. > > 5. Lastly, busy poll must be enabled via a sysctl > (/proc/sys/net/core/busy_poll). > > Please see Eric Dumazet's paper about busy polling [3] and a recent > academic paper about measured performance improvements of busy polling [4] > (albeit with a modification that is not currently present in the kernel) > for additional context. > > The unfortunate part about step 5 above is that this enables busy poll > system-wide which affects all user applications on the system, > including epoll-based network applications which were not intended to > be used this way or applications where increased CPU usage for lower > latency network processing is unnecessary or not desirable. > > If the user wants to run one low latency epoll-based server application > with epoll-based busy poll, but would like to run the rest of the > applications on the system (which may also use epoll) without busy poll, > this system-wide sysctl presents a significant problem. > > This change preserves the system-wide sysctl, but adds a mechanism (via > ioctl) to enable or disable busy poll for epoll contexts as needed by > individual applications, making epoll-based busy poll more usable. > > Note that this change includes an or (as of v4) instead of an xor. If the > user has enabled both the system-wide sysctl and also the per epoll-context > busy poll settings, then epoll should probably busy poll (vs being > disabled). > > Thanks, > Joe > > v5 -> v6: > - patch 1/3 no functional change, but commit message corrected to explain > that an or (||) is being used instead of xor. > > - patch 3/4 is a new patch which adds support for per epoll context > prefer busy poll setting. > > - patch 4/4 updated to allow getting/setting per epoll context prefer > busy poll setting; this setting is limited to either 0 or 1. > > v4 -> v5: > - patch 3/3 updated to use memchr_inv to ensure that __pad is zero for > the EPIOCSPARAMS ioctl. Recommended by Greg K-H [5], Dave Chinner [6], > and Jiri Slaby [7]. > > v3 -> v4: > - patch 1/3 was updated to include an important functional change: > ep_busy_loop_on was updated to use or (||) instead of xor (^). After > thinking about it a bit more, I thought xor didn't make much sense. > Enabling both the per-epoll context and the system-wide sysctl should > probably enable busy poll, not disable it. So, or (||) makes more > sense, I think. > > - patch 3/3 was updated: > - to change the epoll_params fields to be __u64, __u16, and __u8 and > to pad the struct to a multiple of 64bits. Suggested by Greg K-H [8] > and Arnd Bergmann [9]. > - remove an unused pr_fmt, left over from the previous revision. > - ioctl now returns -EINVAL when epoll_params.busy_poll_usecs > > U32_MAX. > > v2 -> v3: > - cover letter updated to mention why ioctl seems (to me) like a better > choice vs a new syscall. > > - patch 3/4 was modified in 3 ways: > - when an unknown ioctl is received, -ENOIOCTLCMD is returned instead > of -EINVAL as the ioctl documentation requires. > - epoll_params.busy_poll_budget can only be set to a value larger than > NAPI_POLL_WEIGHT if code is run by privileged (CAP_NET_ADMIN) users. > Otherwise, -EPERM is returned. > - busy poll specific ioctl code moved out to its own function. On > kernels without busy poll support, -EOPNOTSUPP is returned. This also > makes the kernel build robot happier without littering the code with > more #ifdefs. > > - dropped patch 4/4 after Eric Dumazet's review of it when it was sent > independently to the list [10]. > > v1 -> v2: > - cover letter updated to make a mention of napi_defer_hard_irqs and > gro_flush_timeout as an added step 3 and to cite both Eric Dumazet's > busy polling paper and a paper from University of Waterloo for > additional context. Specifically calling out the xor in patch 1/4 > incase it is missed by reviewers. > > - Patch 2/4 has its commit message updated, but no functional changes. > Commit message now describes that allowing for a settable budget helps > to improve throughput and is more consistent with other busy poll > mechanisms that allow a settable budget via SO_BUSY_POLL_BUDGET. > > - Patch 3/4 was modified to check if the epoll_params.busy_poll_budget > exceeds NAPI_POLL_WEIGHT. The larger value is allowed, but an error is > printed. This was done for consistency with netif_napi_add_weight, > which does the same. > > - Patch 3/4 the struct epoll_params was updated to fix the type of the > data field; it was uint8_t and was changed to u8. > > - Patch 4/4 added to check if SO_BUSY_POLL_BUDGET exceeds > NAPI_POLL_WEIGHT. The larger value is allowed, but an error is > printed. This was done for consistency with netif_napi_add_weight, > which does the same. > > [1]: https://lore.kernel.org/lkml/65b1cb7f73a6a_250560294bd@xxxxxxxxxxxxxxxxxxxxxx.notmuch/ > [2]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/ > [3]: https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf > [4]: https://dl.acm.org/doi/pdf/10.1145/3626780 > [5]: https://lore.kernel.org/lkml/2024013001-prison-strum-899d@gregkh/ > [6]: https://lore.kernel.org/lkml/Zbm3AXgcwL9D6TNM@xxxxxxxxxxxxxxxxxxx/ > [7]: https://lore.kernel.org/lkml/efee9789-4f05-4202-9a95-21d88f6307b0@xxxxxxxxxx/ > [8]: https://lore.kernel.org/lkml/2024012551-anyone-demeaning-867b@gregkh/ > [9]: https://lore.kernel.org/lkml/57b62135-2159-493d-a6bb-47d5be55154a@xxxxxxxxxxxxxxxx/ > [10]: https://lore.kernel.org/lkml/CANn89i+uXsdSVFiQT9fDfGw+h_5QOcuHwPdWi9J=5U6oLXkQTA@xxxxxxxxxxxxxx/ > > Joe Damato (4): > eventpoll: support busy poll per epoll instance > eventpoll: Add per-epoll busy poll packet budget > eventpoll: Add per-epoll prefer busy poll option > eventpoll: Add epoll ioctl for epoll_params > > .../userspace-api/ioctl/ioctl-number.rst | 1 + > fs/eventpoll.c | 136 +++++++++++++++++- > include/uapi/linux/eventpoll.h | 13 ++ > 3 files changed, 144 insertions(+), 6 deletions(-) Coincidentally, we were looking into the same area and your patches are super useful :-) Thank you for plumbing in prefer_busy_poll. Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx>