Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Joe Damato <jdamato@xxxxxxxxxx> · Wed, 14 Aug 2024 15:19:44 +0100

On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote:
> Martin Karsten wrote:
> > On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > On 08/12, Martin Karsten wrote:
> > >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > >>> On 08/12, Martin Karsten wrote:
> > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > >>>>> On 08/12, Martin Karsten wrote:
> > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > >>>>>>> On 08/12, Joe Damato wrote:

[...]

> > >>
> > >> One of the goals of this patch set is to reduce parameter tuning and make
> > >> the parameter setting independent of workload dynamics, so it should make
> > >> things easier. This is of course notwithstanding that per-napi settings
> > >> would be even better.
> 
> I don't follow how adding another tunable reduces parameter tuning.

Thanks for taking a look and providing valuable feedback, Willem.

An early draft of the cover letter included some paragraphs which were removed
for the sake of brevity that we can add back in, which addresses your comment
above:

 The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout)
 is useful, but picking the correct values for these settings is difficult and
 the ideal values change as the type of traffic and workload changes.

 For example: picking a large timeout value is good for batching packet
 processing, but induces latency. Picking a small timeout value will interrupt
 user app processing and reduce efficiency. The value chosen would be different
 depending on whether the system is under high load (large timeout) or when less
 busy (small timeout).

As such, adding the new tunable makes it much easier to use the existing ones
and also produces better performance as shown in the results we presented. 

Please let me know if you have any questions; I know that the change we are
introducing is very subtle and I am happy to expand the cover letter if it'd be
helpful for you.

My concern was that the cover letter was too long already, but a big takeaway
for me thus far has been that we should expand the cover letter.

[...]

> > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > disabling napi during busy polling is super useful, but it would still
> > > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > a per-napi basis imho.
> > 
> > Fingers crossed. I hope this patch will be accepted, because it has 
> > practical performance and efficiency benefits, and that this will 
> > further increase the motivation to re-design the entire irq 
> > defer(/suspend) infrastructure for per-napi settings.
> 
> Overall, the idea of keeping interrupts disabled during event
> processing is very interesting.

Thanks; I'm happy to hear we are aligned on this.

> Hopefully the interface can be made more intuitive. Or documented more
> easily. I had to read the kernel patches to fully (perhaps) grasp it.
> 
> Another +1 on the referenced paper. Pointing out a specific difference
> in behavior that is unrelated to the protection domain, rather than a
> straightforward kernel vs user argument. The paper also had some
> explanation that may be clearer for a commit message than the current
> cover letter:
> 
> "user-level network stacks put the application in charge of the entire
> network stack processing (cf. Section 2). Interrupts are disabled and
> the application coordinates execution by alternating between
> processing existing requests and polling the RX queues for new data"
> " [This series extends this behavior to kernel busy polling, while
> falling back onto interrupt processing to limit CPU overhead.]
> 
> "Instead of re-enabling the respective interrupt(s) as soon as
> epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
> masked until a subsequent epoll_wait() call comes up empty, i.e., no
> events of interest are found and the application thread is about to be
> blocked."
> 
> "A fallback technical approach would use a kernel timeout set on the
> return path from epoll_wait(). If necessary, the timeout re-enables
> interrupts regardless of the application´s (mis)behaviour."
> [Where misbehavior is not calling epoll_wait again]
> 
> "The resulting execution model mimics the execution model of typical
> user-level network stacks and does not add any requirements compared
> to user-level networking. In fact, it is slightly better, because it
> can resort to blocking and interrupt delivery, instead of having to
> continuously busyloop during idle times."
>
> This last part shows a preference on your part to a trade-off:
> you want low latency, but also low cpu utilization if possible.
> This also came up in this thread. Please state that design decision
> explicitly.

Sure, we can include that in the list of cover letter updates we
need to make. I could have called it out more clearly, but in the cover
letter [1] I mentioned that latency improved for compared CPU usage (i.e.
CPU efficiency improved):

  The overall takeaway from the results below is that the new mechanism
  (suspend20, see below) results in reduced 99th percentile latency and
  increased QPS in the MAX QPS case (compared to the other cases), and
  reduced latency in the lower QPS cases for comparable CPU usage to the
  base case (and less CPU than the busy case).

> There are plenty of workloads where burning a core is acceptable
> (especially as core count continues increasing), not "slightly worse".

Respectfully, I don't think I'm on board with this argument. "Burning a core"
has side effects even as core counts increase (power, cooling, etc) and it
seems the counter argument is equally valid, as well: there are plenty of
workloads where burning a core is undesirable.

Using less CPU to get comparable performance is strictly better, even if a
system can theoretically support the increased CPU/power/cooling load.

Either way: this is not an either/or. Adding support for the code we've
proposed will be very beneficial for an important set of workloads without
taking anything anyway.

- Joe

[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@xxxxxxxxxx/