> -----Original Message----- > From: Michael Kelley <mhklinux@xxxxxxxxxxx> > Sent: Friday, January 12, 2024 11:37 AM > To: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> > Cc: Yury Norov <yury.norov@xxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>; > Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui > <decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx; > kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>; > leon@xxxxxxxxxx; cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; > vkuznets@xxxxxxxxxx; tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; > netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux- > rdma@xxxxxxxxxxxxxxx; Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>; > Paul Rosswurm <paulros@xxxxxxxxxxxxx> > Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread > IRQs per CPUs > > [Some people who received this message don't often get email from > mhklinux@xxxxxxxxxxx. Learn why this is important at > https://aka.ms/LearnAboutSenderIdentification ] > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: > Wednesday, January 10, 2024 10:13 PM > > > > The test topology was used to check the performance between > > cpu_local_spread() and the new approach is : > > Case 1 > > IRQ Nodes Cores CPUs > > 0 1 0 0-1 > > 1 1 1 2-3 > > 2 1 2 4-5 > > 3 1 3 6-7 > > > > and with existing cpu_local_spread() > > Case 2 > > IRQ Nodes Cores CPUs > > 0 1 0 0 > > 1 1 0 1 > > 2 1 1 2 > > 3 1 1 3 > > > > Total 4 channels were used, which was set up by ethtool. > > case 1 with ntttcp has given 15 percent better performance, than > > case 2. During the test irqbalance was disabled as well. > > > > Also you are right, with 64CPU system this approach will spread > > the irqs like the cpu_local_spread() but in the future we will offer > > MANA nodes, with more than 64 CPUs. There it this new design will > > give better performance. > > > > I will add this performance benefit details in commit message of > > next version. > > Here are my concerns: > > 1. The most commonly used VMs these days have 64 or fewer > vCPUs and won't see any performance benefit. > > 2. Larger VMs probably won't see the full 15% benefit because > all vCPUs in the local NUMA node will be assigned IRQs. For > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48 > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1 > in a way that avoids sharing a core. But overall the means > that 75% of the IRQs will still be sharing a core and > presumably not see any perf benefit. > > 3. Your experiment was on a relatively small scale: 4 IRQs > spread across 2 cores vs. across 4 cores. Have you run any > experiments on VMs with 128 vCPUs (for example) where > most of the IRQs are not sharing a core? I'm wondering if > the results with 4 IRQs really scale up to 64 IRQs. A lot can > be different in a VM with 64 cores and 2 NUMA nodes vs. > 4 cores in a single node. > > 4. The new algorithm prefers assigning to all vCPUs in > each NUMA hop over assigning to separate cores. Are there > experiments showing that is the right tradeoff? What > are the results if assigning to separate cores is preferred? I remember in a customer case, putting the IRQs on the same NUMA node has better perf. But I agree, this should be re-tested on MANA nic. - Haiyang