Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Tue, 13 Aug 2024 23:16:07 -0400

Martin Karsten wrote:
> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > On 08/12, Martin Karsten wrote:
> >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> >>> On 08/12, Martin Karsten wrote:
> >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> >>>>> On 08/12, Martin Karsten wrote:
> >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> >>>>>>> On 08/12, Joe Damato wrote:
> >>>>>>>> Greetings:
> 
> [snip]
> 
> >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing
> >>>>>>> busy polling code is not super readable, so would be nice to simplify
> >>>>>>> it a bit in the process (if possible) instead of adding one more tunable.
> >>>>>>
> >>>>>> There are essentially three possible loops for network processing:
> >>>>>>
> >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> >>>>>>
> >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> >>>>>> with the shortcomings described above
> >>>>>>
> >>>>>> 3) epoll -> busy-poll -> napi poll
> >>>>>>
> >>>>>> If a system is configured for 1), not much can be done, as it is difficult
> >>>>>> to interject anything into this loop without adding state and side effects.
> >>>>>> This is what we tried for the paper, but it ended up being a hack.
> >>>>>>
> >>>>>> If however the system is configured for irq deferral, Loops 2) and 3)
> >>>>>> "wrestle" with each other for control. Injecting the larger
> >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> >>>>>> of Loop 3) and creates the nice pattern describe above.
> >>>>>
> >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
> >>>>> isn't fast enough to keep up with the timer, presumably? I wonder
> >>>>> if need to use this opportunity and do proper API as Joe hints in the
> >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
> >>>>> this queue / napi_id and with this timeout". And then we can essentially make
> >>>>> gro_flush_timeout per queue (and avoid
> >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> >>>>> too hacky already :-(
> >>>>
> >>>> If someone would implement the necessary changes to make these parameters
> >>>> per-napi, this would improve things further, but note that the current
> >>>> proposal gives strong performance across a range of workloads, which is
> >>>> otherwise difficult to impossible to achieve.
> >>>
> >>> Let's see what other people have to say. But we tried to do a similar
> >>> setup at Google recently and getting all these parameters right
> >>> was not trivial. Joe's recent patch series to push some of these into
> >>> epoll context are a step in the right direction. It would be nice to
> >>> have more explicit interface to express busy poling preference for
> >>> the users vs chasing a bunch of global tunables and fighting against softirq
> >>> wakups.
> >>
> >> One of the goals of this patch set is to reduce parameter tuning and make
> >> the parameter setting independent of workload dynamics, so it should make
> >> things easier. This is of course notwithstanding that per-napi settings
> >> would be even better.

I don't follow how adding another tunable reduces parameter tuning.

> >>
> >> If you are able to share more details of your previous experiments (here or
> >> off-list), I would be very interested.
> > 
> > We went through a similar exercise of trying to get the tail latencies down.
> > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
> > we went with a hard-coded napi_id argument instead of tracking) and trying to
> > get a workable set of budget/timeout/gro_flush. We were fine with burning all
> > cpu capacity we had and no sleep at all, so we ended up having a bunch
> > of special cases in epoll loop to avoid the sleep.
> > 
> > But we were trying to make a different model work (the one you mention in the
> > paper as well) where the userspace busy-pollers are just running napi_poll
> > on one cpu and the actual work is consumed by the userspace on a different cpu.
> > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
> > and another epoll fd with actual sockets for signaling).
> > 
> > This mode has a different set of challenges with socket lock, socket rx
> > queue and the backlog processing :-(
> 
> I agree. That model has challenges and is extremely difficult to tune right.
> 
> >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> >>>> an individual queue or application to make sure that IRQ suspension is
> >>>> enabled/disabled right away when the state of the system changes from busy
> >>>> to idle and back.
> >>>
> >>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> >>> contexts as "explicitly polled by userspace with a larger defer timeout",
> >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> >>> which is more like "this particular napi_poll call is user busy polling".
> >>
> >> Then either the application needs to be polling all the time (wasting cpu
> >> cycles) or latencies will be determined by the timeout.
> >>
> >> Only when switching back and forth between polling and interrupts is it
> >> possible to get low latencies across a large spectrum of offered loads
> >> without burning cpu cycles at 100%.
> > 
> > Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > about ep_suspend_napi_irqs/napi_resume_irqs.
> 
> Thanks for probing and double-checking everything! Feedback is important 
> for us to properly document our proposal.
> 
> > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > disabling napi during busy polling is super useful, but it would still
> > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > a per-napi basis imho.
> 
> Fingers crossed. I hope this patch will be accepted, because it has 
> practical performance and efficiency benefits, and that this will 
> further increase the motivation to re-design the entire irq 
> defer(/suspend) infrastructure for per-napi settings.

Overall, the idea of keeping interrupts disabled during event
processing is very interesting.

Hopefully the interface can be made more intuitive. Or documented more
easily. I had to read the kernel patches to fully (perhaps) grasp it.

Another +1 on the referenced paper. Pointing out a specific difference
in behavior that is unrelated to the protection domain, rather than a
straightforward kernel vs user argument. The paper also had some
explanation that may be clearer for a commit message than the current
cover letter:

"user-level network stacks put the application in charge of the entire
network stack processing (cf. Section 2). Interrupts are disabled and
the application coordinates execution by alternating between
processing existing requests and polling the RX queues for new data"
" [This series extends this behavior to kernel busy polling, while
falling back onto interrupt processing to limit CPU overhead.]

"Instead of re-enabling the respective interrupt(s) as soon as
epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
masked until a subsequent epoll_wait() call comes up empty, i.e., no
events of interest are found and the application thread is about to be
blocked."

"A fallback technical approach would use a kernel timeout set on the
return path from epoll_wait(). If necessary, the timeout re-enables
interrupts regardless of the application’s (mis)behaviour."
[Where misbehavior is not calling epoll_wait again]

"The resulting execution model mimics the execution model of typical
user-level network stacks and does not add any requirements compared
to user-level networking. In fact, it is slightly better, because it
can resort to blocking and interrupt delivery, instead of having to
continuously busyloop during idle times."

This last part shows a preference on your part to a trade-off:
you want low latency, but also low cpu utilization if possible.
This also came up in this thread. Please state that design decision
explicitly.

There are plenty of workloads where burning a core is acceptable
(especially as core count continues increasing), not "slightly worse".
Kernel polling with full busy polling is also already possible, by
choosing a very high napi_defer_hard_irqs and gro_flush_timeout. So
high in fact, that these tunables need not be tuned carefully. So
what this series add is not interrupt suppression during event
processing per se, but doing so in a hybrid mode balancing latency
and cpu load.