From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: Tuesday, January 9, 2024 2:51 AM > > From: Yury Norov <yury.norov@xxxxxxxxx> > > Souradeep investigated that the driver performs faster if IRQs are > spread on CPUs with the following heuristics: > > 1. No more than one IRQ per CPU, if possible; > 2. NUMA locality is the second priority; > 3. Sibling dislocality is the last priority. > > Let's consider this topology: > > Node 0 1 > Core 0 1 2 3 > CPU 0 1 2 3 4 5 6 7 > > The most performant IRQ distribution based on the above topology > and heuristics may look like this: > > IRQ Nodes Cores CPUs > 0 1 0 0-1 > 1 1 1 2-3 > 2 1 0 0-1 > 3 1 1 2-3 > 4 2 2 4-5 > 5 2 3 6-7 > 6 2 2 4-5 > 7 2 3 6-7 I didn't pay attention to the detailed discussion of this issue over the past 2 to 3 weeks during the holidays in the U.S., but the above doesn't align with the original problem as I understood it. I thought the original problem was to avoid putting IRQs on both hyper-threads in the same core, and that the perf improvements are based on that configuration. At least that's what the commit message for Patch 4/4 in this series says. The above chart results in 8 IRQs being assigned to the 8 CPUs, probably with 1 IRQ per CPU. At least on x86, if the affinity mask for an IRQ contains multiple CPUs, matrix_find_best_cpu() should balance the IRQ assignments between the CPUs in the mask. So the original problem is still present because both hyper-threads in a core are likely to have an IRQ assigned. Of course, this example has 8 IRQs and 8 CPUs, so assigning an IRQ to every hyper-thread may be the only choice. If that's the case, maybe this just isn't a good example to illustrate the original problem and solution. But even with a better example where the # of IRQs is <= half the # of CPUs in a NUMA node, I don't think the code below accomplishes the original intent. Maybe I've missed something along the way in getting to this version of the patch. Please feel free to set me straight. :-) Michael > > The irq_setup() routine introduced in this patch leverages the > for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups > as described above. > > According to [1], for NUMA-aware but sibling-ignorant IRQ distribution > based on cpumask_local_spread() performance test results look like this: > > /ntttcp -r -m 16 > NTTTCP for Linux 1.4.0 > --------------------------------------------------------- > 08:05:20 INFO: 17 threads created > 08:05:28 INFO: Network activity progressing... > 08:06:28 INFO: Test run completed. > 08:06:28 INFO: Test cycle finished. > 08:06:28 INFO: ##### Totals: ##### > 08:06:28 INFO: test duration :60.00 seconds > 08:06:28 INFO: total bytes :630292053310 > 08:06:28 INFO: throughput :84.04Gbps > 08:06:28 INFO: retrans segs :4 > 08:06:28 INFO: cpu cores :192 > 08:06:28 INFO: cpu speed :3799.725MHz > 08:06:28 INFO: user :0.05% > 08:06:28 INFO: system :1.60% > 08:06:28 INFO: idle :96.41% > 08:06:28 INFO: iowait :0.00% > 08:06:28 INFO: softirq :1.94% > 08:06:28 INFO: cycles/byte :2.50 > 08:06:28 INFO: cpu busy (all) :534.41% > > For NUMA- and sibling-aware IRQ distribution, the same test works > 15% faster: > > /ntttcp -r -m 16 > NTTTCP for Linux 1.4.0 > --------------------------------------------------------- > 08:08:51 INFO: 17 threads created > 08:08:56 INFO: Network activity progressing... > 08:09:56 INFO: Test run completed. > 08:09:56 INFO: Test cycle finished. > 08:09:56 INFO: ##### Totals: ##### > 08:09:56 INFO: test duration :60.00 seconds > 08:09:56 INFO: total bytes :741966608384 > 08:09:56 INFO: throughput :98.93Gbps > 08:09:56 INFO: retrans segs :6 > 08:09:56 INFO: cpu cores :192 > 08:09:56 INFO: cpu speed :3799.791MHz > 08:09:56 INFO: user :0.06% > 08:09:56 INFO: system :1.81% > 08:09:56 INFO: idle :96.18% > 08:09:56 INFO: iowait :0.00% > 08:09:56 INFO: softirq :1.95% > 08:09:56 INFO: cycles/byte :2.25 > 08:09:56 INFO: cpu busy (all) :569.22% > > [1] > https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3 > yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ > > Signed-off-by: Yury Norov <yury.norov@xxxxxxxxx> > Co-developed-by: Souradeep Chakrabarti > <schakrabarti@xxxxxxxxxxxxxxxxxxx> > --- > .../net/ethernet/microsoft/mana/gdma_main.c | 29 > +++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c > b/drivers/net/ethernet/microsoft/mana/gdma_main.c > index 6367de0c2c2e..6a967d6be01e 100644 > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > @@ -1243,6 +1243,35 @@ void mana_gd_free_res_map(struct gdma_resource *r) > r->size = 0; > } > > +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node) > +{ > + const struct cpumask *next, *prev = cpu_none_mask; > + cpumask_var_t cpus __free(free_cpumask_var); > + int cpu, weight; > + > + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) > + return -ENOMEM; > + > + rcu_read_lock(); > + for_each_numa_hop_mask(next, node) { > + weight = cpumask_weight_andnot(next, prev); > + while (weight > 0) { > + cpumask_andnot(cpus, next, prev); > + for_each_cpu(cpu, cpus) { > + if (len-- == 0) > + goto done; > + irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu)); > + cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); > + --weight; > + } > + } > + prev = next; > + } > +done: > + rcu_read_unlock(); > + return 0; > +} > + > static int mana_gd_setup_irqs(struct pci_dev *pdev) > { > unsigned int max_queues_per_port = num_online_cpus(); > -- > 2.34.1 >