From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: Friday, January 12, 2024 10:31 PM > On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote: > > > > > -----Original Message----- > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Friday, January 12, 2024 11:37 AM > > > > > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: > > > Wednesday, January 10, 2024 10:13 PM > > > > > > > > The test topology was used to check the performance between > > > > cpu_local_spread() and the new approach is : > > > > Case 1 > > > > IRQ Nodes Cores CPUs > > > > 0 1 0 0-1 > > > > 1 1 1 2-3 > > > > 2 1 2 4-5 > > > > 3 1 3 6-7 > > > > > > > > and with existing cpu_local_spread() > > > > Case 2 > > > > IRQ Nodes Cores CPUs > > > > 0 1 0 0 > > > > 1 1 0 1 > > > > 2 1 1 2 > > > > 3 1 1 3 > > > > > > > > Total 4 channels were used, which was set up by ethtool. > > > > case 1 with ntttcp has given 15 percent better performance, than > > > > case 2. During the test irqbalance was disabled as well. > > > > > > > > Also you are right, with 64CPU system this approach will spread > > > > the irqs like the cpu_local_spread() but in the future we will offer > > > > MANA nodes, with more than 64 CPUs. There it this new design will > > > > give better performance. > > > > > > > > I will add this performance benefit details in commit message of > > > > next version. > > > > > > Here are my concerns: > > > > > > 1. The most commonly used VMs these days have 64 or fewer > > > vCPUs and won't see any performance benefit. > > > > > > 2. Larger VMs probably won't see the full 15% benefit because > > > all vCPUs in the local NUMA node will be assigned IRQs. For > > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48 > > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining > > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1 > > > in a way that avoids sharing a core. But overall the means > > > that 75% of the IRQs will still be sharing a core and > > > presumably not see any perf benefit. > > > > > > 3. Your experiment was on a relatively small scale: 4 IRQs > > > spread across 2 cores vs. across 4 cores. Have you run any > > > experiments on VMs with 128 vCPUs (for example) where > > > most of the IRQs are not sharing a core? I'm wondering if > > > the results with 4 IRQs really scale up to 64 IRQs. A lot can > > > be different in a VM with 64 cores and 2 NUMA nodes vs. > > > 4 cores in a single node. > > > > > > 4. The new algorithm prefers assigning to all vCPUs in > > > each NUMA hop over assigning to separate cores. Are there > > > experiments showing that is the right tradeoff? What > > > are the results if assigning to separate cores is preferred? > > > > I remember in a customer case, putting the IRQs on the same > > NUMA node has better perf. But I agree, this should be re-tested > > on MANA nic. > > 1) and 2) The change will not decrease the existing performance, but for > system with high number of CPU, will be benefited after this. > > 3) The result has shown around 6 percent improvement. > > 4)The test result has shown around 10 percent difference when IRQs are > spread on multiple numa nodes. OK, this looks pretty good. Make clear in the commit messages what the tradeoffs are, and what the real-world benefits are expected to be. Some future developer who wants to understand why IRQs are assigned this way will thank you. :-) Michael