On Fri, 24 Mar 2023 18:57:03 +0100 Felix Fietkau wrote: > >> It can basically be used to make RPS a bit more dynamic and > >> configurable, because you can assign multiple backlog threads to a set > >> of CPUs and selectively steer packets from specific devices / rx queues > > > > Can you give an example? > > > > With the 4 CPU example, in case 2 queues are very busy - you're trying > > to make sure that the RPS does not end up landing on the same CPU as > > the other busy queue? > > In this part I'm thinking about bigger systems where you want to have a > group of CPUs dedicated to dealing with network traffic without > assigning a fixed function (e.g. NAPI processing or RPS target) to each > one, allowing for more dynamic processing. I tried the threaded NAPI on larger systems and helped others try, and so far it's not been beneficial :( Even the load balancing improvements are not significant enough to use it, and there is a large risk of scheduler making the wrong decision. Hence my questioning - I'm trying to understand what you're doing differently. > >> to them and allow the scheduler to take care of the rest. > > > > You trust the scheduler much more than I do, I think :) > > In my tests it brings down latency (both avg and p99) considerably in > some cases. I posted some numbers here: > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@xxxxxxxx/ Could you provide the full configuration for this test? In non-threaded mode the RPS is enabled to spread over remaining 3 cores?