Re: [PATCH net-next v2 6/6] docs: networking: Describe irq suspension

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 21, 2024 at 01:53:01AM +0000, Joe Damato wrote:
> diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
> index dfa5d549be9c..3b43477a52ce 100644
> --- a/Documentation/networking/napi.rst
> +++ b/Documentation/networking/napi.rst
> @@ -192,6 +192,28 @@ is reused to control the delay of the timer, while
>  ``napi_defer_hard_irqs`` controls the number of consecutive empty polls
>  before NAPI gives up and goes back to using hardware IRQs.
>  
> +The above parameters can also be set on a per-NAPI basis using netlink via
> +netdev-genl. This can be done programmatically in a user application or by
> +using a script included in the kernel source tree: ``tools/net/ynl/cli.py``.
> +
> +For example, using the script:
> +
> +.. code-block:: bash
> +
> +  $ kernel-source/tools/net/ynl/cli.py \
> +            --spec Documentation/netlink/specs/netdev.yaml \
> +            --do napi-set \
> +            --json='{"id": 345,
> +                     "defer-hard-irqs": 111,
> +                     "gro-flush-timeout": 11111}'
> +
> +Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
> +via netdev-genl. There is no global sysfs parameter for this value.

In JSON, both gro-flush-timeout and irq-suspend-timeout parameter
names are written in hyphens; but the rest of the docs uses underscores
(that is, gro_flush_timeout and irq_suspend_timeout), right?

> +
> +``irq_suspend_timeout`` is used to determine how long an application can
> +completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
> +which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
> +
>  .. _poll:
>  
>  Busy polling
> @@ -207,6 +229,46 @@ selected sockets or using the global ``net.core.busy_poll`` and
>  ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
>  also exists.
>  
> +epoll-based busy polling
> +------------------------
> +
> +It is possible to trigger packet processing directly from calls to
> +``epoll_wait``. In order to use this feature, a user application must ensure
> +all file descriptors which are added to an epoll context have the same NAPI ID.
> +
> +If the application uses a dedicated acceptor thread, the application can obtain
> +the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
> +distribute that file descriptor to a worker thread. The worker thread would add
> +the file descriptor to its epoll context. This would ensure each worker thread
> +has an epoll context with FDs that have the same NAPI ID.
> +
> +Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program be
> +inserted to distribute incoming connections to threads such that each thread is
> +only given incoming connections with the same NAPI ID. Care must be taken to
> +carefully handle cases where a system may have multiple NICs.
> +
> +In order to enable busy polling, there are two choices:
> +
> +1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
> +   loop waiting for events. This is a system-wide setting and will cause all
> +   epoll-based applications to busy poll when they call epoll_wait. This may
> +   not be desirable as many applications may not have the need to busy poll.
> +
> +2. Applications using recent kernels can issue an ioctl on the epoll context
> +   file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
> +   epoll_params``:, which user programs can define as follows:
> +
> +.. code-block:: c
> +
> +  struct epoll_params {
> +      uint32_t busy_poll_usecs;
> +      uint16_t busy_poll_budget;
> +      uint8_t prefer_busy_poll;
> +
> +      /* pad the struct to a multiple of 64bits */
> +      uint8_t __pad;
> +  };
> +
>  IRQ mitigation
>  ---------------
>  
> @@ -222,12 +284,78 @@ Such applications can pledge to the kernel that they will perform a busy
>  polling operation periodically, and the driver should keep the device IRQs
>  permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
>  socket option. To avoid system misbehavior the pledge is revoked
> -if ``gro_flush_timeout`` passes without any busy poll call.
> +if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
> +busy polling applications, the ``prefer_busy_poll`` field of ``struct
> +epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
> +enable this mode. See the above section for more details.
>  
>  The NAPI budget for busy polling is lower than the default (which makes
>  sense given the low latency intention of normal busy polling). This is
>  not the case with IRQ mitigation, however, so the budget can be adjusted
> -with the ``SO_BUSY_POLL_BUDGET`` socket option.
> +with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
> +applications, the ``busy_poll_budget`` field can be adjusted to the desired value
> +in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
> +ioctl. See the above section for more details.
> +
> +It is important to note that choosing a large value for ``gro_flush_timeout``
> +will defer IRQs to allow for better batch processing, but will induce latency
> +when the system is not fully loaded. Choosing a small value for
> +``gro_flush_timeout`` can cause interference of the user application which is
> +attempting to busy poll by device IRQs and softirq processing. This value
> +should be chosen carefully with these tradeoffs in mind. epoll-based busy
> +polling applications may be able to mitigate how much user processing happens
> +by choosing an appropriate value for ``maxevents``.
> +
> +Users may want to consider an alternate approach, IRQ suspension, to help deal
> +with these tradeoffs.
> +
> +IRQ suspension
> +--------------
> +
> +IRQ suspension is a mechanism wherein device IRQs are masked while epoll
> +triggers NAPI packet processing.
> +
> +While application calls to epoll_wait successfully retrieve events, the kernel will
> +defer the IRQ suspension timer. If the kernel does not retrieve any events
> +while busy polling (for example, because network traffic levels subsided), IRQ
> +suspension is disabled and the IRQ mitigation strategies described above are
> +engaged.
> +
> +This allows users to balance CPU consumption with network processing
> +efficiency.
> +
> +To use this mechanism:
> +
> +  1. The per-NAPI config parameter ``irq_suspend_timeout`` should be set to the
> +     maximum time (in nanoseconds) the application can have its IRQs
> +     suspended. This is done using netlink, as described above. This timeout
> +     serves as a safety mechanism to restart IRQ driver interrupt processing if
> +     the application has stalled. This value should be chosen so that it covers
> +     the amount of time the user application needs to process data from its
> +     call to epoll_wait, noting that applications can control how much data
> +     they retrieve by setting ``max_events`` when calling epoll_wait.
> +
> +  2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
> +     and ``napi_defer_hard_irqs`` can be set to low values. They will be used
> +     to defer IRQs after busy poll has found no data.
> +
> +  3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
> +     the ``EPIOCSPARAMS`` ioctl as described above.
> +
> +  4. The application uses epoll as described above to trigger NAPI packet
> +     processing.
> +
> +As mentioned above, as long as subsequent calls to epoll_wait return events to
> +userland, the ``irq_suspend_timeout`` is deferred and IRQs are disabled. This
> +allows the application to process data without interference.
> +
> +Once a call to epoll_wait results in no events being found, IRQ suspension is
> +automatically disabled and the ``gro_flush_timeout`` and
> +``napi_defer_hard_irqs`` mitigation mechanisms take over.
> +
> +It is expected that ``irq_suspend_timeout`` will be set to a value much larger
> +than ``gro_flush_timeout`` as ``irq_suspend_timeout`` should suspend IRQs for
> +the duration of one userland processing cycle.
>  
>  .. _threaded:
>  

The rest LGTM, thanks!

Reviewed-by: Bagas Sanjaya <bagasdotme@xxxxxxxxx>

-- 
An old man doll... just what I always wanted! - Clara

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux