On 08/12, Martin Karsten wrote: > On 2024-08-12 19:03, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > > > On 2024-08-12 16:19, Stanislav Fomichev wrote: > > > > On 08/12, Joe Damato wrote: > > > > > Greetings: > > > > > > > > > > Martin Karsten (CC'd) and I have been collaborating on some ideas about > > > > > ways of reducing tail latency when using epoll-based busy poll and we'd > > > > > love to get feedback from the list on the code in this series. This is > > > > > the idea I mentioned at netdev conf, for those who were there. Barring > > > > > any major issues, we hope to submit this officially shortly after RFC. > > > > > > > > > > The basic idea for suspending IRQs in this manner was described in an > > > > > earlier paper presented at Sigmetrics 2024 [1]. > > > > > > > > Let me explicitly call out the paper. Very nice analysis! > > > > > > Thank you! > > > > > > [snip] > > > > > > > > Here's how it is intended to work: > > > > > - An administrator sets the existing sysfs parameters for > > > > > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > > > > > > > > > > - An administrator sets the new sysfs parameter irq_suspend_timeout > > > > > to a larger value than gro-timeout to enable IRQ suspension. > > > > > > > > Can you expand more on what's the problem with the existing gro_flush_timeout? > > > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the > > > > perfer_busy_poll case(why?)? Because looking at the first two patches, > > > > you essentially replace all usages of gro_flush_timeout with a new variable > > > > and I don't see how it helps. > > > > > > gro-flush-timeout (in combination with defer-hard-irqs) is the default irq > > > deferral mechanism and as such, always active when configured. Its static > > > periodic softirq processing leads to a situation where: > > > > > > - A long gro-flush-timeout causes high latencies when load is sufficiently > > > below capacity, or > > > > > > - a short gro-flush-timeout causes overhead when softirq execution > > > asynchronously competes with application processing at high load. > > > > > > The shortcomings of this are documented (to some extent) by our experiments. > > > See defer20 working well at low load, but having problems at high load, > > > while defer200 having higher latency at low load. > > > > > > irq-suspend-timeout is only active when an application uses > > > prefer-busy-polling and in that case, produces a nice alternating pattern of > > > application processing and networking processing (similar to what we > > > describe in the paper). This then works well with both low and high load. > > > > So you only want it for the prefer-busy-pollingc case, makes sense. I was > > a bit confused by the difference between defer200 and suspend200, > > but now I see that defer200 does not enable busypoll. > > > > I'm assuming that if you enable busypool in defer200 case, the numbers > > should be similar to suspend200 (ignoring potentially affecting > > non-busypolling queues due to higher gro_flush_timeout). > > defer200 + napi busy poll is essentially what we labelled "busy" and it does > not perform as well, since it still suffers interference between application > and softirq processing. With all your patches applied? Why? Userspace not keeping up? > > > > Maybe expand more on what code paths are we trying to improve? Existing > > > > busy polling code is not super readable, so would be nice to simplify > > > > it a bit in the process (if possible) instead of adding one more tunable. > > > > > > There are essentially three possible loops for network processing: > > > > > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality > > > > > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > > > with the shortcomings described above > > > > > > 3) epoll -> busy-poll -> napi poll > > > > > > If a system is configured for 1), not much can be done, as it is difficult > > > to interject anything into this loop without adding state and side effects. > > > This is what we tried for the paper, but it ended up being a hack. > > > > > > If however the system is configured for irq deferral, Loops 2) and 3) > > > "wrestle" with each other for control. Injecting the larger > > > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > > > of Loop 3) and creates the nice pattern describe above. > > > > And you hit (2) when the epoll goes to sleep and/or when the userspace > > isn't fast enough to keep up with the timer, presumably? I wonder > > if need to use this opportunity and do proper API as Joe hints in the > > cover letter. Something over netlink to say "I'm gonna busy-poll on > > this queue / napi_id and with this timeout". And then we can essentially make > > gro_flush_timeout per queue (and avoid > > napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > > too hacky already :-( > > If someone would implement the necessary changes to make these parameters > per-napi, this would improve things further, but note that the current > proposal gives strong performance across a range of workloads, which is > otherwise difficult to impossible to achieve. Let's see what other people have to say. But we tried to do a similar setup at Google recently and getting all these parameters right was not trivial. Joe's recent patch series to push some of these into epoll context are a step in the right direction. It would be nice to have more explicit interface to express busy poling preference for the users vs chasing a bunch of global tunables and fighting against softirq wakups. > Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > an individual queue or application to make sure that IRQ suspension is > enabled/disabled right away when the state of the system changes from busy > to idle and back. Can we not handle everything in napi_busy_loop? If we can mark some napi contexts as "explicitly polled by userspace with a larger defer timeout", we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL which is more like "this particular napi_poll call is user busy polling". > > > [snip] > > > > > > > > - suspendX: > > > > > - set defer_hard_irqs to 100 > > > > > - set gro_flush_timeout to X,000 > > > > > - set irq_suspend_timeout to 20,000,000 > > > > > - enable busy poll via the existing ioctl (busy_poll_usecs = 0, > > > > > busy_poll_budget = 64, prefer_busy_poll = true) > > > > > > > > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback > > > > to busy_poll sysctl value? > > > > > > Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll > > > (sysctl) or busy_poll_usecs is nonzero. However, this might lead to > > > busy-polling even when the application does not actually need or want it. > > > Only one iteration through the busy loop is needed to make the new scheme > > > work. Additional napi busy polling over and above is optional. > > > > Ack, thanks, was trying to understand why not stay with > > busy_poll_usecs=64 for consistency. But I guess you were just > > trying to show that patch 4/5 works. > > Right, and we would potentially be wasting CPU cycles by adding more > busy-looping. Or potentially improving the latency more if you happen to get more packets during busy_poll_usecs duration? I'd imagine some applications might prefer to 100% busy poll without ever going to sleep (that would probably require getting rid of napi_id tracking in epoll, but that's a different story).