On Mon, Oct 21, 2024 at 01:53:01AM +0000, Joe Damato wrote: > diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst > index dfa5d549be9c..3b43477a52ce 100644 > --- a/Documentation/networking/napi.rst > +++ b/Documentation/networking/napi.rst > @@ -192,6 +192,28 @@ is reused to control the delay of the timer, while > ``napi_defer_hard_irqs`` controls the number of consecutive empty polls > before NAPI gives up and goes back to using hardware IRQs. > > +The above parameters can also be set on a per-NAPI basis using netlink via > +netdev-genl. This can be done programmatically in a user application or by > +using a script included in the kernel source tree: ``tools/net/ynl/cli.py``. > + > +For example, using the script: > + > +.. code-block:: bash > + > + $ kernel-source/tools/net/ynl/cli.py \ > + --spec Documentation/netlink/specs/netdev.yaml \ > + --do napi-set \ > + --json='{"id": 345, > + "defer-hard-irqs": 111, > + "gro-flush-timeout": 11111}' > + > +Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink > +via netdev-genl. There is no global sysfs parameter for this value. In JSON, both gro-flush-timeout and irq-suspend-timeout parameter names are written in hyphens; but the rest of the docs uses underscores (that is, gro_flush_timeout and irq_suspend_timeout), right? > + > +``irq_suspend_timeout`` is used to determine how long an application can > +completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL, > +which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl. > + > .. _poll: > > Busy polling > @@ -207,6 +229,46 @@ selected sockets or using the global ``net.core.busy_poll`` and > ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling > also exists. > > +epoll-based busy polling > +------------------------ > + > +It is possible to trigger packet processing directly from calls to > +``epoll_wait``. In order to use this feature, a user application must ensure > +all file descriptors which are added to an epoll context have the same NAPI ID. > + > +If the application uses a dedicated acceptor thread, the application can obtain > +the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then > +distribute that file descriptor to a worker thread. The worker thread would add > +the file descriptor to its epoll context. This would ensure each worker thread > +has an epoll context with FDs that have the same NAPI ID. > + > +Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program be > +inserted to distribute incoming connections to threads such that each thread is > +only given incoming connections with the same NAPI ID. Care must be taken to > +carefully handle cases where a system may have multiple NICs. > + > +In order to enable busy polling, there are two choices: > + > +1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy > + loop waiting for events. This is a system-wide setting and will cause all > + epoll-based applications to busy poll when they call epoll_wait. This may > + not be desirable as many applications may not have the need to busy poll. > + > +2. Applications using recent kernels can issue an ioctl on the epoll context > + file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct > + epoll_params``:, which user programs can define as follows: > + > +.. code-block:: c > + > + struct epoll_params { > + uint32_t busy_poll_usecs; > + uint16_t busy_poll_budget; > + uint8_t prefer_busy_poll; > + > + /* pad the struct to a multiple of 64bits */ > + uint8_t __pad; > + }; > + > IRQ mitigation > --------------- > > @@ -222,12 +284,78 @@ Such applications can pledge to the kernel that they will perform a busy > polling operation periodically, and the driver should keep the device IRQs > permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL`` > socket option. To avoid system misbehavior the pledge is revoked > -if ``gro_flush_timeout`` passes without any busy poll call. > +if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based > +busy polling applications, the ``prefer_busy_poll`` field of ``struct > +epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to > +enable this mode. See the above section for more details. > > The NAPI budget for busy polling is lower than the default (which makes > sense given the low latency intention of normal busy polling). This is > not the case with IRQ mitigation, however, so the budget can be adjusted > -with the ``SO_BUSY_POLL_BUDGET`` socket option. > +with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling > +applications, the ``busy_poll_budget`` field can be adjusted to the desired value > +in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS`` > +ioctl. See the above section for more details. > + > +It is important to note that choosing a large value for ``gro_flush_timeout`` > +will defer IRQs to allow for better batch processing, but will induce latency > +when the system is not fully loaded. Choosing a small value for > +``gro_flush_timeout`` can cause interference of the user application which is > +attempting to busy poll by device IRQs and softirq processing. This value > +should be chosen carefully with these tradeoffs in mind. epoll-based busy > +polling applications may be able to mitigate how much user processing happens > +by choosing an appropriate value for ``maxevents``. > + > +Users may want to consider an alternate approach, IRQ suspension, to help deal > +with these tradeoffs. > + > +IRQ suspension > +-------------- > + > +IRQ suspension is a mechanism wherein device IRQs are masked while epoll > +triggers NAPI packet processing. > + > +While application calls to epoll_wait successfully retrieve events, the kernel will > +defer the IRQ suspension timer. If the kernel does not retrieve any events > +while busy polling (for example, because network traffic levels subsided), IRQ > +suspension is disabled and the IRQ mitigation strategies described above are > +engaged. > + > +This allows users to balance CPU consumption with network processing > +efficiency. > + > +To use this mechanism: > + > + 1. The per-NAPI config parameter ``irq_suspend_timeout`` should be set to the > + maximum time (in nanoseconds) the application can have its IRQs > + suspended. This is done using netlink, as described above. This timeout > + serves as a safety mechanism to restart IRQ driver interrupt processing if > + the application has stalled. This value should be chosen so that it covers > + the amount of time the user application needs to process data from its > + call to epoll_wait, noting that applications can control how much data > + they retrieve by setting ``max_events`` when calling epoll_wait. > + > + 2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout`` > + and ``napi_defer_hard_irqs`` can be set to low values. They will be used > + to defer IRQs after busy poll has found no data. > + > + 3. The ``prefer_busy_poll`` flag must be set to true. This can be done using > + the ``EPIOCSPARAMS`` ioctl as described above. > + > + 4. The application uses epoll as described above to trigger NAPI packet > + processing. > + > +As mentioned above, as long as subsequent calls to epoll_wait return events to > +userland, the ``irq_suspend_timeout`` is deferred and IRQs are disabled. This > +allows the application to process data without interference. > + > +Once a call to epoll_wait results in no events being found, IRQ suspension is > +automatically disabled and the ``gro_flush_timeout`` and > +``napi_defer_hard_irqs`` mitigation mechanisms take over. > + > +It is expected that ``irq_suspend_timeout`` will be set to a value much larger > +than ``gro_flush_timeout`` as ``irq_suspend_timeout`` should suspend IRQs for > +the duration of one userland processing cycle. > > .. _threaded: > The rest LGTM, thanks! Reviewed-by: Bagas Sanjaya <bagasdotme@xxxxxxxxx> -- An old man doll... just what I always wanted! - Clara
Attachment:
signature.asc
Description: PGP signature