Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Martin Karsten <mkarsten@xxxxxxxxxxxx> · Wed, 14 Aug 2024 16:42:27 -0400

On 2024-08-14 15:53, Samiullah Khawaja wrote:
On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@xxxxxxxxxxxx> wrote:

On 2024-08-13 00:07, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 21:54, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 19:03, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 16:19, Stanislav Fomichev wrote:
On 08/12, Joe Damato wrote:
Greetings:

[snip]

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
an individual queue or application to make sure that IRQ suspension is
enabled/disabled right away when the state of the system changes from busy
to idle and back.

Can we not handle everything in napi_busy_loop? If we can mark some napi
contexts as "explicitly polled by userspace with a larger defer timeout",
we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
which is more like "this particular napi_poll call is user busy polling".

Then either the application needs to be polling all the time (wasting cpu
cycles) or latencies will be determined by the timeout.
But if I understand correctly, this means that if the application
thread that is supposed
to do napi busy polling gets busy doing work on the new data/events in
userspace, napi polling
will not be done until the suspend_timeout triggers? Do you dispatch
work to a separate worker
threads, in userspace, from the thread that is doing epoll_wait?

Yes, napi polling is suspended while the application is busy between 
epoll_wait calls. That's where the benefits are coming from.

The consequences depend on the nature of the application and overall 
preferences for the system. If there's a "dominant" application for a 
number of queues and cores, the resulting latency for other background 
applications using the same queues might not be a problem at all.

One other simple mitigation is limiting the number of events that each 
epoll_wait call accepts. Note that this batch size also determines the 
worst-case latency for the application in question, so there is a 
natural incentive to keep it limited.

A more complex application design, like you suggest, might also be an 
option.

Only when switching back and forth between polling and interrupts is it
possible to get low latencies across a large spectrum of offered loads
without burning cpu cycles at 100%.

Ah, I see what you're saying, yes, you're right. In this case ignore my comment
about ep_suspend_napi_irqs/napi_resume_irqs.

Thanks for probing and double-checking everything! Feedback is important
for us to properly document our proposal.

Let's see how other people feel about per-dev irq_suspend_timeout. Properly
disabling napi during busy polling is super useful, but it would still
be nice to plumb irq_suspend_timeout via epoll context or have it set on
a per-napi basis imho.
I agree, this would allow each napi queue to tune itself based on
heuristics. But I think
doing it through epoll independent interface makes more sense as Stan
suggested earlier.

The question is whether to add a useful mechanism (one sysfs parameter 
and a few lines of code) that is optional, but with demonstrable and 
significant performance/efficiency improvements for an important class 
of applications - or wait for an uncertain future?

Note that adding our mechanism in no way precludes switching the control 
parameters from per-device to per-napi as Joe alluded to earlier. In 
fact, it increases the incentive for doing so.

After working on this for quite a while, I am skeptical that anything 
fundamentally different could be done without re-architecting the entire 
napi control flow.

Thanks,
Martin