On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@xxxxxxxxxxxx> wrote: > > On 2024-08-13 00:07, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > >> On 2024-08-12 21:54, Stanislav Fomichev wrote: > >>> On 08/12, Martin Karsten wrote: > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > >>>>> On 08/12, Martin Karsten wrote: > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > >>>>>>> On 08/12, Joe Damato wrote: > >>>>>>>> Greetings: > > [snip] > > >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing > >>>>>>> busy polling code is not super readable, so would be nice to simplify > >>>>>>> it a bit in the process (if possible) instead of adding one more tunable. > >>>>>> > >>>>>> There are essentially three possible loops for network processing: > >>>>>> > >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality > >>>>>> > >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > >>>>>> with the shortcomings described above > >>>>>> > >>>>>> 3) epoll -> busy-poll -> napi poll > >>>>>> > >>>>>> If a system is configured for 1), not much can be done, as it is difficult > >>>>>> to interject anything into this loop without adding state and side effects. > >>>>>> This is what we tried for the paper, but it ended up being a hack. > >>>>>> > >>>>>> If however the system is configured for irq deferral, Loops 2) and 3) > >>>>>> "wrestle" with each other for control. Injecting the larger > >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > >>>>>> of Loop 3) and creates the nice pattern describe above. > >>>>> > >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace > >>>>> isn't fast enough to keep up with the timer, presumably? I wonder > >>>>> if need to use this opportunity and do proper API as Joe hints in the > >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on > >>>>> this queue / napi_id and with this timeout". And then we can essentially make > >>>>> gro_flush_timeout per queue (and avoid > >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > >>>>> too hacky already :-( > >>>> > >>>> If someone would implement the necessary changes to make these parameters > >>>> per-napi, this would improve things further, but note that the current > >>>> proposal gives strong performance across a range of workloads, which is > >>>> otherwise difficult to impossible to achieve. > >>> > >>> Let's see what other people have to say. But we tried to do a similar > >>> setup at Google recently and getting all these parameters right > >>> was not trivial. Joe's recent patch series to push some of these into > >>> epoll context are a step in the right direction. It would be nice to > >>> have more explicit interface to express busy poling preference for > >>> the users vs chasing a bunch of global tunables and fighting against softirq > >>> wakups. > >> > >> One of the goals of this patch set is to reduce parameter tuning and make > >> the parameter setting independent of workload dynamics, so it should make > >> things easier. This is of course notwithstanding that per-napi settings > >> would be even better. > >> > >> If you are able to share more details of your previous experiments (here or > >> off-list), I would be very interested. > > > > We went through a similar exercise of trying to get the tail latencies down. > > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except > > we went with a hard-coded napi_id argument instead of tracking) and trying to > > get a workable set of budget/timeout/gro_flush. We were fine with burning all > > cpu capacity we had and no sleep at all, so we ended up having a bunch > > of special cases in epoll loop to avoid the sleep. > > > > But we were trying to make a different model work (the one you mention in the > > paper as well) where the userspace busy-pollers are just running napi_poll > > on one cpu and the actual work is consumed by the userspace on a different cpu. > > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll > > and another epoll fd with actual sockets for signaling). > > > > This mode has a different set of challenges with socket lock, socket rx > > queue and the backlog processing :-( > > I agree. That model has challenges and is extremely difficult to tune right. We noticed a similar issue when we were using the same thread to do application work and also napi busy polling on it. Large gro_flush_timeout would improve throughput under load but reduce latency in low load and vice versa. We were actually thinking of having gro_flush_timeout value change based on whether napi is in busy_poll state and resetting it to default value for softirq driven mode when busy_polling is stopped. Since we didn't have cpu constraints, we had dedicated pollers polling napi and separate threads polling for epoll events and doing application work. > > >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > >>>> an individual queue or application to make sure that IRQ suspension is > >>>> enabled/disabled right away when the state of the system changes from busy > >>>> to idle and back. > >>> > >>> Can we not handle everything in napi_busy_loop? If we can mark some napi > >>> contexts as "explicitly polled by userspace with a larger defer timeout", > >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > >>> which is more like "this particular napi_poll call is user busy polling". > >> > >> Then either the application needs to be polling all the time (wasting cpu > >> cycles) or latencies will be determined by the timeout. But if I understand correctly, this means that if the application thread that is supposed to do napi busy polling gets busy doing work on the new data/events in userspace, napi polling will not be done until the suspend_timeout triggers? Do you dispatch work to a separate worker threads, in userspace, from the thread that is doing epoll_wait? > >> > >> Only when switching back and forth between polling and interrupts is it > >> possible to get low latencies across a large spectrum of offered loads > >> without burning cpu cycles at 100%. > > > > Ah, I see what you're saying, yes, you're right. In this case ignore my comment > > about ep_suspend_napi_irqs/napi_resume_irqs. > > Thanks for probing and double-checking everything! Feedback is important > for us to properly document our proposal. > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > disabling napi during busy polling is super useful, but it would still > > be nice to plumb irq_suspend_timeout via epoll context or have it set on > > a per-napi basis imho. I agree, this would allow each napi queue to tune itself based on heuristics. But I think doing it through epoll independent interface makes more sense as Stan suggested earlier. > > Fingers crossed. I hope this patch will be accepted, because it has > practical performance and efficiency benefits, and that this will > further increase the motivation to re-design the entire irq > defer(/suspend) infrastructure for per-napi settings. > > Thanks, > Martin > >