On 2024-08-14 15:53, Samiullah Khawaja wrote:
On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@xxxxxxxxxxxx> wrote:
On 2024-08-13 00:07, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 21:54, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 19:03, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 16:19, Stanislav Fomichev wrote:
On 08/12, Joe Damato wrote:
Greetings:
[snip]
Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
an individual queue or application to make sure that IRQ suspension is
enabled/disabled right away when the state of the system changes from busy
to idle and back.
Can we not handle everything in napi_busy_loop? If we can mark some napi
contexts as "explicitly polled by userspace with a larger defer timeout",
we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
which is more like "this particular napi_poll call is user busy polling".
Then either the application needs to be polling all the time (wasting cpu
cycles) or latencies will be determined by the timeout.
But if I understand correctly, this means that if the application
thread that is supposed
to do napi busy polling gets busy doing work on the new data/events in
userspace, napi polling
will not be done until the suspend_timeout triggers? Do you dispatch
work to a separate worker
threads, in userspace, from the thread that is doing epoll_wait?
Yes, napi polling is suspended while the application is busy between
epoll_wait calls. That's where the benefits are coming from.
The consequences depend on the nature of the application and overall
preferences for the system. If there's a "dominant" application for a
number of queues and cores, the resulting latency for other background
applications using the same queues might not be a problem at all.
One other simple mitigation is limiting the number of events that each
epoll_wait call accepts. Note that this batch size also determines the
worst-case latency for the application in question, so there is a
natural incentive to keep it limited.
A more complex application design, like you suggest, might also be an
option.
Only when switching back and forth between polling and interrupts is it
possible to get low latencies across a large spectrum of offered loads
without burning cpu cycles at 100%.
Ah, I see what you're saying, yes, you're right. In this case ignore my comment
about ep_suspend_napi_irqs/napi_resume_irqs.
Thanks for probing and double-checking everything! Feedback is important
for us to properly document our proposal.
Let's see how other people feel about per-dev irq_suspend_timeout. Properly
disabling napi during busy polling is super useful, but it would still
be nice to plumb irq_suspend_timeout via epoll context or have it set on
a per-napi basis imho.
I agree, this would allow each napi queue to tune itself based on
heuristics. But I think
doing it through epoll independent interface makes more sense as Stan
suggested earlier.
The question is whether to add a useful mechanism (one sysfs parameter
and a few lines of code) that is optional, but with demonstrable and
significant performance/efficiency improvements for an important class
of applications - or wait for an uncertain future?
Note that adding our mechanism in no way precludes switching the control
parameters from per-device to per-napi as Joe alluded to earlier. In
fact, it increases the incentive for doing so.
After working on this for quite a while, I am skeptical that anything
fundamentally different could be done without re-architecting the entire
napi control flow.
Thanks,
Martin