Re: [PATCH net-next] net/core: add optional threading for backlog processing

Paolo Abeni <pabeni@xxxxxxxxxx> · Tue, 28 Mar 2023 17:13:20 +0200

On Tue, 2023-03-28 at 11:45 +0200, Felix Fietkau wrote:
> On 28.03.23 11:29, Paolo Abeni wrote:
> > On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:
> > > On 24.03.23 18:47, Jakub Kicinski wrote:
> > > > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
> > > > > I'm primarily testing this on routers with 2 or 4 CPUs and limited 
> > > > > processing power, handling routing/NAT. RPS is typically needed to 
> > > > > properly distribute the load across all available CPUs. When there is 
> > > > > only a small number of flows that are pushing a lot of traffic, a static 
> > > > > RPS assignment often leaves some CPUs idle, whereas others become a 
> > > > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
> > > > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
> > > > 
> > > > The NAPI thread becomes a bottleneck with RPS enabled?
> > > 
> > > The devices that I work with often only have a single rx queue. That can
> > > easily become a bottleneck.
> > > 
> > > > > Making backlog processing threaded helps split up the processing work 
> > > > > even more and distribute it onto remaining idle CPUs.
> > > > 
> > > > You'd want to have both threaded NAPI and threaded backlog enabled?
> > > 
> > > Yes
> > > 
> > > > > It can basically be used to make RPS a bit more dynamic and 
> > > > > configurable, because you can assign multiple backlog threads to a set 
> > > > > of CPUs and selectively steer packets from specific devices / rx queues 
> > > > 
> > > > Can you give an example?
> > > > 
> > > > With the 4 CPU example, in case 2 queues are very busy - you're trying
> > > > to make sure that the RPS does not end up landing on the same CPU as
> > > > the other busy queue?
> > > 
> > > In this part I'm thinking about bigger systems where you want to have a
> > > group of CPUs dedicated to dealing with network traffic without
> > > assigning a fixed function (e.g. NAPI processing or RPS target) to each
> > > one, allowing for more dynamic processing.
> > > 
> > > > > to them and allow the scheduler to take care of the rest.
> > > > 
> > > > You trust the scheduler much more than I do, I think :)
> > > 
> > > In my tests it brings down latency (both avg and p99) considerably in
> > > some cases. I posted some numbers here:
> > > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@xxxxxxxx/
> > 
> > It's still not 110% clear to me why/how this additional thread could
> > reduce latency. What/which threads are competing for the busy CPU[s]? I
> > suspect it could be easier/cleaner move away the others (non RPS)
> > threads.
> In the tests that I'm doing, network processing load from routing/NAT is 
> enough to occupy all available CPUs.
> If I dedicate the NAPI thread to one core and use RPS to steer packet 
> processing to the other cores, the core taking care of NAPI has some 
> idle cycles that go to waste, while the other cores are busy.
> If I include the core in the RPS mask, it can take too much away from 
> the NAPI thread.

I feel like I'm missing some relevant points.

If RPS keeps the target CPU fully busy, moving RPS processing in a
separate thread still will not allow using more CPU time.

Which NIC driver are you using?

thanks!

Paolo