Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Martin Karsten <mkarsten@xxxxxxxxxxxx> · Tue, 13 Aug 2024 09:18:13 -0400

On 2024-08-13 00:07, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 21:54, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 19:03, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 16:19, Stanislav Fomichev wrote:
On 08/12, Joe Damato wrote:
Greetings:

[snip]

Maybe expand more on what code paths are we trying to improve? Existing
busy polling code is not super readable, so would be nice to simplify
it a bit in the process (if possible) instead of adding one more tunable.

There are essentially three possible loops for network processing:

1) hardirq -> softirq -> napi poll; this is the baseline functionality

2) timer -> softirq -> napi poll; this is deferred irq processing scheme
with the shortcomings described above

3) epoll -> busy-poll -> napi poll

If a system is configured for 1), not much can be done, as it is difficult
to interject anything into this loop without adding state and side effects.
This is what we tried for the paper, but it ended up being a hack.

If however the system is configured for irq deferral, Loops 2) and 3)
"wrestle" with each other for control. Injecting the larger
irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
of Loop 3) and creates the nice pattern describe above.

And you hit (2) when the epoll goes to sleep and/or when the userspace
isn't fast enough to keep up with the timer, presumably? I wonder
if need to use this opportunity and do proper API as Joe hints in the
cover letter. Something over netlink to say "I'm gonna busy-poll on
this queue / napi_id and with this timeout". And then we can essentially make
gro_flush_timeout per queue (and avoid
napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
too hacky already :-(

If someone would implement the necessary changes to make these parameters
per-napi, this would improve things further, but note that the current
proposal gives strong performance across a range of workloads, which is
otherwise difficult to impossible to achieve.

Let's see what other people have to say. But we tried to do a similar
setup at Google recently and getting all these parameters right
was not trivial. Joe's recent patch series to push some of these into
epoll context are a step in the right direction. It would be nice to
have more explicit interface to express busy poling preference for
the users vs chasing a bunch of global tunables and fighting against softirq
wakups.

One of the goals of this patch set is to reduce parameter tuning and make
the parameter setting independent of workload dynamics, so it should make
things easier. This is of course notwithstanding that per-napi settings
would be even better.

If you are able to share more details of your previous experiments (here or
off-list), I would be very interested.

We went through a similar exercise of trying to get the tail latencies down.
Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
we went with a hard-coded napi_id argument instead of tracking) and trying to
get a workable set of budget/timeout/gro_flush. We were fine with burning all
cpu capacity we had and no sleep at all, so we ended up having a bunch
of special cases in epoll loop to avoid the sleep.

But we were trying to make a different model work (the one you mention in the
paper as well) where the userspace busy-pollers are just running napi_poll
on one cpu and the actual work is consumed by the userspace on a different cpu.
(we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
and another epoll fd with actual sockets for signaling).

This mode has a different set of challenges with socket lock, socket rx
queue and the backlog processing :-(

I agree. That model has challenges and is extremely difficult to tune right.

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
an individual queue or application to make sure that IRQ suspension is
enabled/disabled right away when the state of the system changes from busy
to idle and back.

Can we not handle everything in napi_busy_loop? If we can mark some napi
contexts as "explicitly polled by userspace with a larger defer timeout",
we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
which is more like "this particular napi_poll call is user busy polling".

Then either the application needs to be polling all the time (wasting cpu
cycles) or latencies will be determined by the timeout.

Only when switching back and forth between polling and interrupts is it
possible to get low latencies across a large spectrum of offered loads
without burning cpu cycles at 100%.

Ah, I see what you're saying, yes, you're right. In this case ignore my comment
about ep_suspend_napi_irqs/napi_resume_irqs.

Thanks for probing and double-checking everything! Feedback is important 
for us to properly document our proposal.

Let's see how other people feel about per-dev irq_suspend_timeout. Properly
disabling napi during busy polling is super useful, but it would still
be nice to plumb irq_suspend_timeout via epoll context or have it set on
a per-napi basis imho.

Fingers crossed. I hope this patch will be accepted, because it has 
practical performance and efficiency benefits, and that this will 
further increase the motivation to re-design the entire irq 
defer(/suspend) infrastructure for per-napi settings.

Thanks,
Martin