Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Martin Karsten <mkarsten@xxxxxxxxxxxx> · Sat, 17 Aug 2024 14:15:30 -0400

On 2024-08-16 16:58, Willem de Bruijn wrote:
Martin Karsten wrote:
On 2024-08-16 13:01, Willem de Bruijn wrote:
Joe Damato wrote:
On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
Willem de Bruijn wrote:
Martin Karsten wrote:
On 2024-08-14 15:53, Samiullah Khawaja wrote:
On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@xxxxxxxxxxxx> wrote:

On 2024-08-13 00:07, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 21:54, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 19:03, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 16:19, Stanislav Fomichev wrote:
On 08/12, Joe Damato wrote:
Greetings:

[snip]

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
an individual queue or application to make sure that IRQ suspension is
enabled/disabled right away when the state of the system changes from busy
to idle and back.

Can we not handle everything in napi_busy_loop? If we can mark some napi
contexts as "explicitly polled by userspace with a larger defer timeout",
we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
which is more like "this particular napi_poll call is user busy polling".

Then either the application needs to be polling all the time (wasting cpu
cycles) or latencies will be determined by the timeout.
But if I understand correctly, this means that if the application
thread that is supposed
to do napi busy polling gets busy doing work on the new data/events in
userspace, napi polling
will not be done until the suspend_timeout triggers? Do you dispatch
work to a separate worker
threads, in userspace, from the thread that is doing epoll_wait?

Yes, napi polling is suspended while the application is busy between
epoll_wait calls. That's where the benefits are coming from.

The consequences depend on the nature of the application and overall
preferences for the system. If there's a "dominant" application for a
number of queues and cores, the resulting latency for other background
applications using the same queues might not be a problem at all.

One other simple mitigation is limiting the number of events that each
epoll_wait call accepts. Note that this batch size also determines the
worst-case latency for the application in question, so there is a
natural incentive to keep it limited.

A more complex application design, like you suggest, might also be an
option.

Only when switching back and forth between polling and interrupts is it
possible to get low latencies across a large spectrum of offered loads
without burning cpu cycles at 100%.

Ah, I see what you're saying, yes, you're right. In this case ignore my comment
about ep_suspend_napi_irqs/napi_resume_irqs.

Thanks for probing and double-checking everything! Feedback is important
for us to properly document our proposal.

Let's see how other people feel about per-dev irq_suspend_timeout. Properly
disabling napi during busy polling is super useful, but it would still
be nice to plumb irq_suspend_timeout via epoll context or have it set on
a per-napi basis imho.
I agree, this would allow each napi queue to tune itself based on
heuristics. But I think
doing it through epoll independent interface makes more sense as Stan
suggested earlier.

The question is whether to add a useful mechanism (one sysfs parameter
and a few lines of code) that is optional, but with demonstrable and
significant performance/efficiency improvements for an important class
of applications - or wait for an uncertain future?

The issue is that this one little change can never be removed, as it
becomes ABI.

Let's get the right API from the start.

Not sure that a global variable, or sysfs as API, is the right one.

Sorry per-device, not global.

My main concern is that it adds yet another user tunable integer, for
which the right value is not obvious.

This is a feature for advanced users just like SO_INCOMING_NAPI_ID
and countless other features.

The value may not be obvious, but guidance (in the form of
documentation) can be provided.

Okay. Could you share a stab at what that would look like?

The timeout needs to be large enough that an application can get a
meaningful number of incoming requests processed without softirq
interference. At the same time, the timeout value determines the
worst-case delivery delay that a concurrent application using the same
queue(s) might experience. Please also see my response to Samiullah
quoted above. The specific circumstances and trade-offs might vary,
that's why a simple constant likely won't do.

Thanks. I really do mean this as an exercise of what documentation in
Documentation/networking/napi.rst will look like. That helps makes the
case that the interface is reasonably ease to use (even if only
targeting advanced users).

How does a user measure how much time a process will spend on
processing a meaningful number of incoming requests, for instance.
In practice, probably just a hunch?

As an example, we measure around 1M QPS in our experiments, fully 
utilizing 8 cores and knowing that memcached is quite scalable. Thus we 
can conclude a single request takes about 8 us processing time on 
average. That has led us to a 20 us small timeout (gro_flush_timeout), 
enough to make sure that a single request is likely not interfered with, 
but otherwise as small as possible. If multiple requests arrive, the 
system will quickly switch back to polling mode.

At the other end, we have picked a very large irq_suspend_timeout of 
20,000 us to demonstrate that it does not negatively impact latency. 
This would cover 2,500 requests, which is likely excessive, but was 
chosen for demonstration purposes. One can easily measure the 
distribution of epoll_wait batch sizes and batch sizes as low as 64 are 
already very efficient, even in high-load situations.

Also see next paragraph.

Playing devil's advocate some more: given that ethtool usecs have to
be chosen with a similar trade-off between latency and efficiency,
could a multiplicative factor of this (or gro_flush_timeout, same
thing) be sufficient and easier to choose? The documentation does
state that the value chosen must be >= gro_flush_timeout.

I believe this would take away flexibility without gaining much. You'd 
still want some sort of admin-controlled 'enable' flag, so you'd still 
need some kind of parameter.

When using our scheme, the factor between gro_flush_timeout and 
irq_suspend_timeout should *roughly* correspond to the maximum batch 
size that an application would process in one go (orders of magnitude, 
see above). This determines both the target application's worst-case 
latency as well as the worst-case latency of concurrent applications, if 
any, as mentioned previously. I believe the optimal factor will vary 
between different scenarios.

If the only goal is to safely reenable interrupts when the application
stops calling epoll_wait, does this have to be user tunable?

Can it be either a single good enough constant, or derived from
another tunable, like busypoll_read.

I believe you meant busy_read here, is that right?

At any rate:

    - I don't think a single constant is appropriate, just as it
      wasn't appropriate for the existing mechanism
      (napi_defer_hard_irqs/gro_flush_timeout), and

    - Deriving the value from a pre-existing parameter to preserve the
      ABI, like busy_read, makes using this more confusing for users
      and complicates the API significantly.

I agree we should get the API right from the start; that's why we've
submit this as an RFC ;)

We are happy to take suggestions from the community, but, IMHO,
re-using an existing parameter for a different purpose only in
certain circumstances (if I understand your suggestions) is a much
worse choice than adding a new tunable that clearly states its
intended singular purpose.

Ack. I was thinking whether an epoll flag through your new epoll
ioctl interface to toggle the IRQ suspension (and timer start)
would be preferable. Because more fine grained.

A value provided by an application through the epoll ioctl would not be
subject to admin oversight, so a misbehaving application could set an
arbitrary timeout value. A sysfs value needs to be set by an admin. The
ideal timeout value depends both on the particular target application as
well as concurrent applications using the same queue(s) - as sketched above.

I meant setting the value systemwide (or per-device), but opting in to
the feature a binary epoll options. Really an epoll_wait flag, if we
had flags.

Any admin privileged operations can also be protected at the epoll
level by requiring CAP_NET_ADMIN too, of course. But fair point that
this might operate in a multi-process environment, so values should
not be hardcoded into the binaries.

Just asking questions to explore the option space so as not to settle
on an API too soon. Given that, as said, we cannot remove it later.

I agree, but I believe we are converging? Also taking into account Joe's 
earlier response, given that the suspend mechanism dovetails so nicely 
with gro_flush_timeout and napi_defer_hard_irqs, it just seems natural 
to put irq_suspend_timeout at the same level and I haven't seen any 
strong reason to put it elsewhere.

Also, the value is likely dependent more on the expected duration
of userspace processing? If so, it would be the same for all
devices, so does a per-netdev value make sense?

It is per-netdev in the current proposal to be at the same granularity
as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension
operates at the same level/granularity. This allows for more control
than a global setting and it can be migrated to per-napi settings along
with gro_flush_timeout and napi_defer_hard_irqs when the time comes.

Ack, makes sense. Many of these design choices and their rationale are
good to explicitly capture in the commit message.

Agreed.

Thanks,
Martin