On Tue, Dec 19, 2023 at 10:18:49AM +0000, Souradeep Chakrabarti wrote: > > > >-----Original Message----- > >From: Yury Norov <yury.norov@xxxxxxxxx> > >Sent: Monday, December 18, 2023 3:02 AM > >To: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>; KY Srinivasan > ><kys@xxxxxxxxxxxxx>; Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; > >wei.liu@xxxxxxxxxx; Dexuan Cui <decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; > >edumazet@xxxxxxxxxx; kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li > ><longli@xxxxxxxxxxxxx>; yury.norov@xxxxxxxxx; leon@xxxxxxxxxx; > >cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; vkuznets@xxxxxxxxxx; > >tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux- > >kernel@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx > >Cc: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>; Paul Rosswurm > ><paulros@xxxxxxxxxxxxx> > >Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per > >CPUs > > > >[Some people who received this message don't often get email from > >yury.norov@xxxxxxxxx. Learn why this is important at > >https://aka.ms/LearnAboutSenderIdentification ] > > > >Souradeep investigated that the driver performs faster if IRQs are spread on CPUs > >with the following heuristics: > > > >1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second > >priority; 3. Sibling dislocality is the last priority. > > > >Let's consider this topology: > > > >Node 0 1 > >Core 0 1 2 3 > >CPU 0 1 2 3 4 5 6 7 > > > >The most performant IRQ distribution based on the above topology and heuristics > >may look like this: > > > >IRQ Nodes Cores CPUs > >0 1 0 0-1 > >1 1 1 2-3 > >2 1 0 0-1 > >3 1 1 2-3 > >4 2 2 4-5 > >5 2 3 6-7 > >6 2 2 4-5 > >7 2 3 6-7 > > > >The irq_setup() routine introduced in this patch leverages the > >for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as > >described above. > > > >According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on > >cpumask_local_spread() performance test results look like this: > > > >./ntttcp -r -m 16 > >NTTTCP for Linux 1.4.0 > >--------------------------------------------------------- > >08:05:20 INFO: 17 threads created > >08:05:28 INFO: Network activity progressing... > >08:06:28 INFO: Test run completed. > >08:06:28 INFO: Test cycle finished. > >08:06:28 INFO: ##### Totals: ##### > >08:06:28 INFO: test duration :60.00 seconds > >08:06:28 INFO: total bytes :630292053310 > >08:06:28 INFO: throughput :84.04Gbps > >08:06:28 INFO: retrans segs :4 > >08:06:28 INFO: cpu cores :192 > >08:06:28 INFO: cpu speed :3799.725MHz > >08:06:28 INFO: user :0.05% > >08:06:28 INFO: system :1.60% > >08:06:28 INFO: idle :96.41% > >08:06:28 INFO: iowait :0.00% > >08:06:28 INFO: softirq :1.94% > >08:06:28 INFO: cycles/byte :2.50 > >08:06:28 INFO: cpu busy (all) :534.41% > > > >For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: > > > >./ntttcp -r -m 16 > >NTTTCP for Linux 1.4.0 > >--------------------------------------------------------- > >08:08:51 INFO: 17 threads created > >08:08:56 INFO: Network activity progressing... > >08:09:56 INFO: Test run completed. > >08:09:56 INFO: Test cycle finished. > >08:09:56 INFO: ##### Totals: ##### > >08:09:56 INFO: test duration :60.00 seconds > >08:09:56 INFO: total bytes :741966608384 > >08:09:56 INFO: throughput :98.93Gbps > >08:09:56 INFO: retrans segs :6 > >08:09:56 INFO: cpu cores :192 > >08:09:56 INFO: cpu speed :3799.791MHz > >08:09:56 INFO: user :0.06% > >08:09:56 INFO: system :1.81% > >08:09:56 INFO: idle :96.18% > >08:09:56 INFO: iowait :0.00% > >08:09:56 INFO: softirq :1.95% > >08:09:56 INFO: cycles/byte :2.25 > >08:09:56 INFO: cpu busy (all) :569.22% > > > >[1] > >https://lore.kernel/ > >.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v > >ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros > >oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7 > >cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d > >8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D% > >7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26 > >IhqkE%3D&reserved=0 > > > >Signed-off-by: Yury Norov <yury.norov@xxxxxxxxx> > >Co-developed-by: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> > >--- > > .../net/ethernet/microsoft/mana/gdma_main.c | 28 +++++++++++++++++++ > > 1 file changed, 28 insertions(+) > > > >diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c > >b/drivers/net/ethernet/microsoft/mana/gdma_main.c > >index 6367de0c2c2e..11e64e42e3b2 100644 > >--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > >+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > >@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource > >*r) > > r->size = 0; > > } > > > >+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int > >+len, int node) { > >+ const struct cpumask *next, *prev = cpu_none_mask; > >+ cpumask_var_t cpus __free(free_cpumask_var); > >+ int cpu, weight; > >+ > >+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) > >+ return -ENOMEM; > >+ > >+ rcu_read_lock(); > >+ for_each_numa_hop_mask(next, node) { > >+ weight = cpumask_weight_andnot(next, prev); > >+ while (weight-- > 0) { > Make it while (weight > 0) { > >+ cpumask_andnot(cpus, next, prev); > >+ for_each_cpu(cpu, cpus) { > >+ if (len-- == 0) > >+ goto done; > >+ irq_set_affinity_and_hint(*irqs++, > >topology_sibling_cpumask(cpu)); > >+ cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); > Here do --weight, else this code will traverse the same node N^2 times, where each > node has N cpus . Sure. When building your series on top of this, can you please fix it inplace? Thanks, Yury > >+ } > >+ } > >+ prev = next; > >+ } > >+done: > >+ rcu_read_unlock(); > >+ return 0; > >+} > >+ > > static int mana_gd_setup_irqs(struct pci_dev *pdev) { > > unsigned int max_queues_per_port = num_online_cpus(); > >-- > >2.40.1