Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Samiullah Khawaja <skhawaja@xxxxxxxxxx> · Wed, 14 Aug 2024 12:53:07 -0700

On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@xxxxxxxxxxxx> wrote:
>
> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > On 08/12, Martin Karsten wrote:
> >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> >>> On 08/12, Martin Karsten wrote:
> >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> >>>>> On 08/12, Martin Karsten wrote:
> >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> >>>>>>> On 08/12, Joe Damato wrote:
> >>>>>>>> Greetings:
>
> [snip]
>
> >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing
> >>>>>>> busy polling code is not super readable, so would be nice to simplify
> >>>>>>> it a bit in the process (if possible) instead of adding one more tunable.
> >>>>>>
> >>>>>> There are essentially three possible loops for network processing:
> >>>>>>
> >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> >>>>>>
> >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> >>>>>> with the shortcomings described above
> >>>>>>
> >>>>>> 3) epoll -> busy-poll -> napi poll
> >>>>>>
> >>>>>> If a system is configured for 1), not much can be done, as it is difficult
> >>>>>> to interject anything into this loop without adding state and side effects.
> >>>>>> This is what we tried for the paper, but it ended up being a hack.
> >>>>>>
> >>>>>> If however the system is configured for irq deferral, Loops 2) and 3)
> >>>>>> "wrestle" with each other for control. Injecting the larger
> >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> >>>>>> of Loop 3) and creates the nice pattern describe above.
> >>>>>
> >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
> >>>>> isn't fast enough to keep up with the timer, presumably? I wonder
> >>>>> if need to use this opportunity and do proper API as Joe hints in the
> >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
> >>>>> this queue / napi_id and with this timeout". And then we can essentially make
> >>>>> gro_flush_timeout per queue (and avoid
> >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> >>>>> too hacky already :-(
> >>>>
> >>>> If someone would implement the necessary changes to make these parameters
> >>>> per-napi, this would improve things further, but note that the current
> >>>> proposal gives strong performance across a range of workloads, which is
> >>>> otherwise difficult to impossible to achieve.
> >>>
> >>> Let's see what other people have to say. But we tried to do a similar
> >>> setup at Google recently and getting all these parameters right
> >>> was not trivial. Joe's recent patch series to push some of these into
> >>> epoll context are a step in the right direction. It would be nice to
> >>> have more explicit interface to express busy poling preference for
> >>> the users vs chasing a bunch of global tunables and fighting against softirq
> >>> wakups.
> >>
> >> One of the goals of this patch set is to reduce parameter tuning and make
> >> the parameter setting independent of workload dynamics, so it should make
> >> things easier. This is of course notwithstanding that per-napi settings
> >> would be even better.
> >>
> >> If you are able to share more details of your previous experiments (here or
> >> off-list), I would be very interested.
> >
> > We went through a similar exercise of trying to get the tail latencies down.
> > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
> > we went with a hard-coded napi_id argument instead of tracking) and trying to
> > get a workable set of budget/timeout/gro_flush. We were fine with burning all
> > cpu capacity we had and no sleep at all, so we ended up having a bunch
> > of special cases in epoll loop to avoid the sleep.
> >
> > But we were trying to make a different model work (the one you mention in the
> > paper as well) where the userspace busy-pollers are just running napi_poll
> > on one cpu and the actual work is consumed by the userspace on a different cpu.
> > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
> > and another epoll fd with actual sockets for signaling).
> >
> > This mode has a different set of challenges with socket lock, socket rx
> > queue and the backlog processing :-(
>
> I agree. That model has challenges and is extremely difficult to tune right.
We noticed a similar issue when we were using the same thread to do
application work
and also napi busy polling on it. Large gro_flush_timeout would
improve throughput under
load but reduce latency in low load and vice versa.

We were actually thinking of having gro_flush_timeout value change
based on whether
napi is in busy_poll state and resetting it to default value for
softirq driven mode when
busy_polling is stopped.

Since we didn't have cpu constraints, we had
dedicated pollers polling napi and separate threads polling for epoll
events and doing
application work.
>
> >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> >>>> an individual queue or application to make sure that IRQ suspension is
> >>>> enabled/disabled right away when the state of the system changes from busy
> >>>> to idle and back.
> >>>
> >>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> >>> contexts as "explicitly polled by userspace with a larger defer timeout",
> >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> >>> which is more like "this particular napi_poll call is user busy polling".
> >>
> >> Then either the application needs to be polling all the time (wasting cpu
> >> cycles) or latencies will be determined by the timeout.
But if I understand correctly, this means that if the application
thread that is supposed
to do napi busy polling gets busy doing work on the new data/events in
userspace, napi polling
will not be done until the suspend_timeout triggers? Do you dispatch
work to a separate worker
threads, in userspace, from the thread that is doing epoll_wait?
> >>
> >> Only when switching back and forth between polling and interrupts is it
> >> possible to get low latencies across a large spectrum of offered loads
> >> without burning cpu cycles at 100%.
> >
> > Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > about ep_suspend_napi_irqs/napi_resume_irqs.
>
> Thanks for probing and double-checking everything! Feedback is important
> for us to properly document our proposal.
>
> > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > disabling napi during busy polling is super useful, but it would still
> > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > a per-napi basis imho.
I agree, this would allow each napi queue to tune itself based on
heuristics. But I think
doing it through epoll independent interface makes more sense as Stan
suggested earlier.
>
> Fingers crossed. I hope this patch will be accepted, because it has
> practical performance and efficiency benefits, and that this will
> further increase the motivation to re-design the entire irq
> defer(/suspend) infrastructure for per-napi settings.
>
> Thanks,
> Martin
>
>