On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote: > > > > -----Original Message----- > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> > > Sent: Friday, January 12, 2024 11:37 AM > > To: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> > > Cc: Yury Norov <yury.norov@xxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>; > > Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui > > <decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx; > > kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>; > > leon@xxxxxxxxxx; cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; > > vkuznets@xxxxxxxxxx; tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; > > netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux- > > rdma@xxxxxxxxxxxxxxx; Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>; > > Paul Rosswurm <paulros@xxxxxxxxxxxxx> > > Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread > > IRQs per CPUs > > > > [Some people who received this message don't often get email from > > mhklinux@xxxxxxxxxxx. Learn why this is important at > > https://aka.ms/LearnAboutSenderIdentification ] > > > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: > > Wednesday, January 10, 2024 10:13 PM > > > > > > The test topology was used to check the performance between > > > cpu_local_spread() and the new approach is : > > > Case 1 > > > IRQ Nodes Cores CPUs > > > 0 1 0 0-1 > > > 1 1 1 2-3 > > > 2 1 2 4-5 > > > 3 1 3 6-7 > > > > > > and with existing cpu_local_spread() > > > Case 2 > > > IRQ Nodes Cores CPUs > > > 0 1 0 0 > > > 1 1 0 1 > > > 2 1 1 2 > > > 3 1 1 3 > > > > > > Total 4 channels were used, which was set up by ethtool. > > > case 1 with ntttcp has given 15 percent better performance, than > > > case 2. During the test irqbalance was disabled as well. > > > > > > Also you are right, with 64CPU system this approach will spread > > > the irqs like the cpu_local_spread() but in the future we will offer > > > MANA nodes, with more than 64 CPUs. There it this new design will > > > give better performance. > > > > > > I will add this performance benefit details in commit message of > > > next version. > > > > Here are my concerns: > > > > 1. The most commonly used VMs these days have 64 or fewer > > vCPUs and won't see any performance benefit. > > > > 2. Larger VMs probably won't see the full 15% benefit because > > all vCPUs in the local NUMA node will be assigned IRQs. For > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48 > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1 > > in a way that avoids sharing a core. But overall the means > > that 75% of the IRQs will still be sharing a core and > > presumably not see any perf benefit. > > > > 3. Your experiment was on a relatively small scale: 4 IRQs > > spread across 2 cores vs. across 4 cores. Have you run any > > experiments on VMs with 128 vCPUs (for example) where > > most of the IRQs are not sharing a core? I'm wondering if > > the results with 4 IRQs really scale up to 64 IRQs. A lot can > > be different in a VM with 64 cores and 2 NUMA nodes vs. > > 4 cores in a single node. > > > > 4. The new algorithm prefers assigning to all vCPUs in > > each NUMA hop over assigning to separate cores. Are there > > experiments showing that is the right tradeoff? What > > are the results if assigning to separate cores is preferred? > > I remember in a customer case, putting the IRQs on the same > NUMA node has better perf. But I agree, this should be re-tested > on MANA nic. 1) and 2) The change will not decrease the existing performance, but for system with high number of CPU, will be benefited after this. 3) The result has shown around 6 percent improvement. 4)The test result has shown around 10 percent difference when IRQs are spread on multiple numa nodes. > > - Haiyang >