Re: [PATCH net-next v2 0/6] Suspend IRQs during application busy periods

Paolo Abeni <pabeni@xxxxxxxxxx> · Tue, 29 Oct 2024 11:25:18 +0100

On 10/21/24 03:52, Joe Damato wrote:
> Greetings:
> 
> Welcome to v2, see changelog below.
> 
> This series introduces a new mechanism, IRQ suspension, which allows
> network applications using epoll to mask IRQs during periods of high
> traffic while also reducing tail latency (compared to existing
> mechanisms, see below) during periods of low traffic. In doing so, this
> balances CPU consumption with network processing efficiency.
> 
> Martin Karsten (CC'd) and I have been collaborating on this series for
> several months and have appreciated the feedback from the community on
> our RFC [1]. We've updated the cover letter and kernel documentation in
> an attempt to more clearly explain how this mechanism works, how
> applications can use it, and how it compares to existing mechanisms in
> the kernel. We've added an additional test case, 'fullbusy', achieved by
> modifying libevent for comparison. See below for a detailed description,
> link to the patch, and test results.
> 
> I briefly mentioned this idea at netdev conf 2024 (for those who were
> there) and Martin described this idea in an earlier paper presented at
> Sigmetrics 2024 [2].
> 
> ~ The short explanation (TL;DR)
> 
> We propose adding a new napi config parameter: irq_suspend_timeout to
> help balance CPU usage and network processing efficiency when using IRQ
> deferral and napi busy poll.
> 
> If this parameter is set to a non-zero value *and* a user application
> has enabled preferred busy poll on a busy poll context (via the
> EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
> epoll ioctl for epoll_params")), then application calls to epoll_wait
> for that context will cause device IRQs and softirq processing to be
> suspended as long as epoll_wait successfully retrieves data from the
> NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
> 
> If/when network traffic subsides and epoll_wait returns no data, IRQ
> suspension is immediately reverted back to the existing
> napi_defer_hard_irqs and gro_flush_timeout mechanism which was
> introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
> feature")).
> 
> The irq_suspend_timeout serves as a safety mechanism. If userland takes
> a long time processing data, irq_suspend_timeout will fire and restart
> normal NAPI processing.
> 
> For a more in depth explanation, please continue reading.
> 
> ~ Comparison with existing mechanisms
> 
> Interrupt mitigation can be accomplished in napi software, by setting
> napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
> in the NIC. This can be quite efficient, but in both cases, a fixed
> timeout (or packet count) needs to be configured. However, a fixed
> timeout cannot effectively support both low- and high-load situations:
> 
> At low load, an application typically processes a few requests and then
> waits to receive more input data. In this scenario, a large timeout will
> cause unnecessary latency.
> 
> At high load, an application typically processes many requests before
> being ready to receive more input data. In this case, a small timeout
> will likely fire prematurely and trigger irq/softirq processing, which
> interferes with the application's execution. This causes overhead, most
> likely due to cache contention.
> 
> While NICs attempt to provide adaptive interrupt coalescing schemes,
> these cannot properly take into account application-level processing.
> 
> An alternative packet delivery mechanism is busy-polling, which results
> in perfect alignment of application processing and network polling. It
> delivers optimal performance (throughput and latency), but results in
> 100% cpu utilization and is thus inefficient for below-capacity
> workloads.
> 
> We propose to add a new packet delivery mode that properly alternates
> between busy polling and interrupt-based delivery depending on busy and
> idle periods of the application. During a busy period, the system
> operates in busy-polling mode, which avoids interference. During an idle
> period, the system falls back to interrupt deferral, but with a small
> timeout to avoid excessive latencies. This delivery mode can also be
> viewed as an extension of basic interrupt deferral, but alternating
> between a small and a very large timeout.
> 
> This delivery mode is efficient, because it avoids softirq execution
> interfering with application processing during busy periods. It can be
> used with blocking epoll_wait to conserve cpu cycles during idle
> periods. The effect of alternating between busy and idle periods is that
> performance (throughput and latency) is very close to full busy polling,
> while cpu utilization is lower and very close to interrupt mitigation.
> 
> ~ Usage details
> 
> IRQ suspension is introduced via a per-NAPI configuration parameter that
> controls the maximum time that IRQs can be suspended.
> 
> Here's how it is intended to work:
>   - The user application (or system administrator) uses the netdev-genl
>     netlink interface to set the pre-existing napi_defer_hard_irqs and
>     gro_flush_timeout NAPI config parameters to enable IRQ deferral.
> 
>   - The user application (or system administrator) sets the proposed
>     irq_suspend_timeout parameter via the netdev-genl netlink interface
>     to a larger value than gro_flush_timeout to enable IRQ suspension.
> 
>   - The user application issues the existing epoll ioctl to set the
>     prefer_busy_poll flag on the epoll context.
> 
>   - The user application then calls epoll_wait to busy poll for network
>     events, as it normally would.
> 
>   - If epoll_wait returns events to userland, IRQs are suspended for the
>     duration of irq_suspend_timeout.
> 
>   - If epoll_wait finds no events and the thread is about to go to
>     sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
>     is resumed.
> 
> As long as epoll_wait is retrieving events, IRQs (and softirq
> processing) for the NAPI being polled remain disabled. When network
> traffic reduces, eventually a busy poll loop in the kernel will retrieve
> no data. When this occurs, regular IRQ deferral using gro_flush_timeout
> for the polled NAPI is re-enabled.
> 
> Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
> automatically times out after the irq_suspend_timeout timer expires.
> Regular deferral is also immediately re-enabled when the epoll context
> is destroyed.
> 
> ~ Usage scenario
> 
> The target scenario for IRQ suspension as packet delivery mode is a
> system that runs a dominant application with substantial network I/O.
> The target application can be configured to receive input data up to a
> certain batch size (via epoll_wait maxevents parameter) and this batch
> size determines the worst-case latency that application requests might
> experience. Because packet delivery is suspended during the target
> application's processing, the batch size also determines the worst-case
> latency of concurrent applications using the same RX queue(s).
> 
> gro_flush_timeout should be set as small as possible, but large enough to
> make sure that a single request is likely not being interfered with.
> 
> irq_suspend_timeout is largely a safety mechanism against misbehaving
> applications. It should be set large enough to cover the processing of an
> entire application batch, i.e., the factor between gro_flush_timeout and
> irq_suspend_timeout should roughly correspond to the maximum batch size
> that the target application would process in one go.
> 
> ~ Design rationale
> 
> The implementation of the IRQ suspension mechanism very nicely dovetails
> with the existing mechanism for IRQ deferral when preferred busy poll is
> enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
> busy-polling"), see that commit message for more details).
> 
> While it would be possible to inject the suspend timeout via
> the existing epoll ioctl, it is more natural to avoid this path for one
> main reason:
> 
>   An epoll context is linked to NAPI IDs as file descriptors are added;
>   this means any epoll context might suddenly be associated with a
>   different net_device if the application were to replace all existing
>   fds with fds from a different device. In this case, the scope of the
>   suspend timeout becomes unclear and many edge cases for both the user
>   application and the kernel are introduced
> 
> Only a single iteration through napi busy polling is needed for this
> mechanism to work effectively. Since an important objective for this
> mechanism is preserving cpu cycles, exactly one iteration of the napi
> busy loop is invoked when busy_poll_usecs is set to 0.
> 
> ~ Important call outs in the implementation
> 
>   - Enabling per epoll-context preferred busy poll will now effectively
>     lead to a nonblocking iteration through napi_busy_loop, even when
>     busy_poll_usecs is 0. See patch 4.
> 
>   - Patches apply cleanly on commit 160a810b2a85 ("net: vxlan: update
>     the document for vxlan_snoop()").
> 
> ~ Benchmark configs & descriptions
> 
> The changes were benchmarked with memcached [3] using the benchmarking
> tool mutilate [4].
> 
> To facilitate benchmarking, a small patch [5] was applied to memcached
> 1.6.29 to allow setting per-epoll context preferred busy poll and other
> settings via environment variables. Another small patch [6] was applied
> to libevent to enable full busy-polling.
> 
> Multiple scenarios were benchmarked as described below and the scripts
> used for producing these results can be found on github [7] (note: all
> scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
> -N to memcached):
> 
>   - base:
>     - no other options enabled
>   - deferX:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to X,000
>   - napibusy:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to 200,000
>     - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
>       busy_poll_budget = 64, prefer_busy_poll = true)
>   - fullbusy:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to 5,000,000
>     - enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
>       busy_poll_budget = 64, prefer_busy_poll = true)
>     - change memcached's nonblocking epoll_wait invocation (via
>       libevent) to using a 1 ms timeout
>   - suspendX:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to X,000
>     - set irq_suspend_timeout to 20,000,000
>     - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>       busy_poll_budget = 64, prefer_busy_poll = true)
> 
> ~ Benchmark results
> 
> Tested on:
> 
> Single socket AMD EPYC 7662 64-Core Processor
> Hyperthreading disabled
> 4 NUMA Zones (NPS=4)
> 16 CPUs per NUMA zone (64 cores total)
> 2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
> 
> The test machine is configured such that a single interface has 8 RX
> queues. The queues' IRQs and memcached are pinned to CPUs that are
> NUMA-local to the interface which is under test. The NIC's interrupt
> coalescing configuration is left at boot-time defaults.
> 
> Results:
> 
> Results are shown below. The mechanism added by this series is
> represented by the 'suspend' cases. Data presented shows a summary over
> at least 10 runs of each test case [8] using the scripts on github [7].
> For latency, the median is shown. For throughput and CPU utilization,
> the average is shown.
> 
> The results also include cycles-per-query (cpq) and
> instruction-per-query (ipq) metrics, following the methodology proposed
> in [2], to augment the CPU utilization numbers, which could be skewed
> due to frequency scaling. We find that this does not appear to be the
> case as CPU utilization and low-level metrics show similar trends.
> 
> These results were captured using the scripts on github [7] to
> illustrate how this approach compares with other pre-existing
> mechanisms. This data is not to be interpreted as scientific data
> captured in a fully isolated lab setting, but instead as best effort,
> illustrative information comparing and contrasting tradeoffs.
> 
> The absolute QPS results are higher than our previous submission, but
> the relative differences between variants are equivalent. Because the
> patches have been rebased on 6.12, several factors have likely
> influenced the overall performance. Most importantly, we had to switch
> to a new set of basic kernel options, which has likely altered the
> baseline performance. Because the overall comparison of variants still
> holds, we have not attempted to recreate the exact set of kernel options
> from the previous submission.
> 
> Compare:
> - Throughput (MAX) and latencies of base vs suspend.
> - CPU usage of napibusy and fullbusy during lower load (200K, 400K for
>   example) vs suspend.
> - Latency of the defer variants vs suspend as timeout and load
>   increases.
> 
> The overall takeaway is that the suspend variants provide a superior
> combination of high throughput, low latency, and low cpu utilization
> compared to all other variants. Each of the suspend variants works very
> well, but some fine-tuning between latency and cpu utilization is still
> possible by tuning the small timeout (gro_flush_timeout).
> 
> Note: we've reorganized the results to make comparison among testcases
> with the same load easier.
> 
>   testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
>       base  200K  200024     127     254     458      25   12748   11289
>    defer10  200K  199991      64     128     166      27   18763   16574
>    defer20  200K  199986      72     135     178      25   15405   14173
>    defer50  200K  200025      91     149     198      23   12275   12203
>   defer200  200K  199996     182     266     326      18    8595    9183
>   fullbusy  200K  200040      58     123     167     100   43641   23145
>   napibusy  200K  200009     115     244     299      56   24797   24693
>  suspend10  200K  200005      63     128     167      32   19559   17240
>  suspend20  200K  199952      69     132     170      29   16324   14838
>  suspend50  200K  200019      84     144     189      26   13106   12516
> suspend200  200K  199978     168     264     326      20    9331    9643
> 
>   testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
>       base  400K  400017     157     292     762      39    9287    9325
>    defer10  400K  400033      71     141     204      53   13950   12943
>    defer20  400K  399935      79     150     212      47   12027   11673
>    defer50  400K  399888     101     171     231      39    9556    9921
>   defer200  400K  399993     200     287     358      32    7428    8576
>   fullbusy  400K  400018      63     132     203     100   21827   16062
>   napibusy  400K  399970      89     230     292      83   18156   16508
>  suspend10  400K  400061      69     139     202      54   13576   13057
>  suspend20  400K  399988      73     144     206      49   11930   11773
>  suspend50  400K  399975      88     161     218      42    9996   10270
> suspend200  400K  399954     172     276     353      34    7847    8713
> 
>   testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
>       base  600K  600031     166     289     631      61    9188    8787
>    defer10  600K  599967      85     167     262      75   11833   10947
>    defer20  600K  599888      89     165     243      66   10513   10362
>    defer50  600K  600072     109     185     253      55    8664    9190
>   defer200  600K  599951     222     315     393      45    6892    8213
>   fullbusy  600K  600041      69     145     227     100   14549   13936
>   napibusy  600K  599980      79     188     280      96   13927   14155
>  suspend10  600K  600028      78     159     267      69   10877   11032
>  suspend20  600K  600026      81     159     254      64    9922   10320
>  suspend50  600K  600007      96     178     258      57    8681    9331
> suspend200  600K  599964     177     295     369      47    7115    8366
> 
>   testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
>       base  800K  800034     198     329     698      84    9366    8338
>    defer10  800K  799718     243     642    1457      95   10532    9007
>    defer20  800K  800009     132     245     399      89    9956    8979
>    defer50  800K  800024     136     228     378      80    9002    8598
>   defer200  800K  799965     255     362     473      66    7481    8147
>   fullbusy  800K  799927      78     157     253     100   10915   12533
>   napibusy  800K  799870      81     173     273      99   10826   12532
>  suspend10  800K  799991      84     167     269      83    9380    9802
>  suspend20  800K  799979      90     172     290      78    8765    9404
>  suspend50  800K  800031     106     191     307      71    7945    8805
> suspend200  800K  799905     182     307     411      62    6985    8242
> 
>   testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
>       base 1000K  919543    3805    6390   14229      98    9324    7978
>    defer10 1000K  850751    4574    7382   15370      99   10218    8470
>    defer20 1000K  890296    4736    6862   14858      99    9708    8277
>    defer50 1000K  932694    3463    6180   13251      97    9148    8053
>   defer200 1000K  951311    3524    6052   13599      96    8875    7845
>   fullbusy 1000K 1000011      90     181     278     100    8731   10686
>   napibusy 1000K 1000050      93     184     280     100    8721   10547
>  suspend10 1000K  999962     101     193     306      92    8138    8980
>  suspend20 1000K 1000030     103     191     324      88    7844    8763
>  suspend50 1000K 1000001     114     202     320      83    7396    8431
> suspend200 1000K  999965     185     314     428      76    6733    8072
> 
>   testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
>       base   MAX 1005592    4651    6594   14979     100    8679    7918
>    defer10   MAX  928204    5106    7286   15199     100    9398    8380
>    defer20   MAX  984663    4774    6518   14920     100    8861    8063
>    defer50   MAX 1044099    4431    6368   14652     100    8350    7948
>   defer200   MAX 1040451    4423    6610   14674     100    8380    7931
>   fullbusy   MAX 1236608    3715    3987   12805     100    7051    7936
>   napibusy   MAX 1077516    4345   10155   15957     100    8080    7842
>  suspend10   MAX 1218344    3760    3990   12585     100    7150    7935
>  suspend20   MAX 1220056    3752    4053   12602     100    7150    7961
>  suspend50   MAX 1213666    3791    4103   12919     100    7183    7959
> suspend200   MAX 1217411    3768    3988   12863     100    7161    7954
> 
> ~ FAQ
> 
>   - Can the new timeout value be threaded through the new epoll ioctl ?
> 
>     Only with difficulty. The epoll ioctl sets options on an epoll
>     context and the NAPI ID associated with an epoll context can change
>     based on what file descriptors a user app adds to the epoll context.
>     This would introduce complexity in the API from the user perspective
>     and also complexity in the kernel.
> 
>   - Can irq suspend be built by combining NIC coalescing and
>     gro_flush_timeout ?
> 
>     No. The problem is that the long timeout must engage if and only if
>     prefer-busy is active.
> 
>     When using NIC coalescing for the short timeout (without
>     napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
>     period will trigger softirq, which will run napi polling. At this
>     point, prefer-busy is not active, so NIC interrupts would be
>     re-enabled. Then it is not possible for the longer timeout to
>     interject to switch control back to polling. In other words, only by
>     using the software timer for the short timeout, it is possible to
>     extend the timeout without having to reprogram the NIC timer or
>     reach down directly and disable interrupts.
> 
>     Using gro_flush_timeout for the long timeout also has problems, for
>     the same underlying reason. In the current napi implementation,
>     gro_flush_timeout is not tied to prefer-busy. We'd either have to
>     change that and in the process modify the existing deferral
>     mechanism, or introduce a state variable to determine whether
>     gro_flush_timeout is used as long timeout for irq suspend or whether
>     it is used for its default purpose. In an earlier version, we did
>     try something similar to the latter and made it work, but it ends up
>     being a lot more convoluted than our current proposal.
> 
>   - Isn't it already possible to combine busy looping with irq deferral?
> 
>     Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
>     gro_flush_timeout is a precondition for prefer_busy_poll to have an
>     effect. If the application also uses a tight busy loop with
>     essentially nonblocking epoll_wait (accomplished with a very short
>     timeout parameter), this is the fullbusy case shown in the results.
>     An application using blocking epoll_wait is shown as the napibusy
>     case in the result. It's a hybrid approach that provides limited
>     latency benefits compared to the base case and plain irq deferral,
>     but not as good as fullbusy or suspend.
> 
> ~ Special thanks
> 
> Several people were involved in earlier stages of the development of this
> mechanism whom we'd like to thank:
> 
>   - Peter Cai (CC'd), for the initial kernel patch and his contributions
>     to the paper.
>     
>   - Mohammadamin Shafie (CC'd), for testing various versions of the kernel
>     patch and providing helpful feedback.
> 
> Thanks,
> Martin and Joe
> 
> [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@xxxxxxxxxx/
> [2]: https://doi.org/10.1145/3626780
> [3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
> [4]: https://github.com/leverich/mutilate
> [5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
> [6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/libevent.patch
> [7]: https://github.com/martinkarsten/irqsuspend
> [8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
> 
> v2:
>   - Cover letter updated, including a re-run of test data.
>   - Patch 1 rewritten to use netdev-genl instead of sysfs.
>   - Patch 3 updated with a comment added to napi_resume_irqs.
>   - Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
>     Annotate data-race of busy_poll_usecs") has been picked up from VFS.
>   - Patch 6 updated the kernel documentation.
> 
> rfc -> v1: https://lore.kernel.org/netdev/20240823173103.94978-1-jdamato@xxxxxxxxxx/
>   - Cover letter updated to include more details.
>   - Patch 1 updated to remove the documentation added. This was moved to
>     patch 6 with the rest of the docs (see below).
>   - Patch 5 updated to fix an error uncovered by the kernel build robot.
>     See patch 5's changelog for more details.
>   - Patch 6 added which updates kernel documentation.

The changes makes sense to me, and I could not find any obvious issue in
the patches.

I think this deserve some - even basic - self-tests coverage. Note that
you can enable GRO on veth devices to make NAPI instances avail there.

Possibly you could opt for a drivers/net defaulting to veth usage and
allowing the user to select real H/W via env variables.

Thanks,

Paolo