On Sat, Jan 13, 2024 at 11:11:50AM -0800, Yury Norov wrote: > On Sat, Jan 13, 2024 at 04:20:31PM +0000, Michael Kelley wrote: > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: Friday, January 12, 2024 10:31 PM > > > > > On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote: > > > > > > > > > -----Original Message----- > > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Friday, January 12, 2024 11:37 AM > > > > > > > > > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: > > > > > Wednesday, January 10, 2024 10:13 PM > > > > > > > > > > > > The test topology was used to check the performance between > > > > > > cpu_local_spread() and the new approach is : > > > > > > Case 1 > > > > > > IRQ Nodes Cores CPUs > > > > > > 0 1 0 0-1 > > > > > > 1 1 1 2-3 > > > > > > 2 1 2 4-5 > > > > > > 3 1 3 6-7 > > > > > > > > > > > > and with existing cpu_local_spread() > > > > > > Case 2 > > > > > > IRQ Nodes Cores CPUs > > > > > > 0 1 0 0 > > > > > > 1 1 0 1 > > > > > > 2 1 1 2 > > > > > > 3 1 1 3 > > > > > > > > > > > > Total 4 channels were used, which was set up by ethtool. > > > > > > case 1 with ntttcp has given 15 percent better performance, than > > > > > > case 2. During the test irqbalance was disabled as well. > > > > > > > > > > > > Also you are right, with 64CPU system this approach will spread > > > > > > the irqs like the cpu_local_spread() but in the future we will offer > > > > > > MANA nodes, with more than 64 CPUs. There it this new design will > > > > > > give better performance. > > > > > > > > > > > > I will add this performance benefit details in commit message of > > > > > > next version. > > > > > > > > > > Here are my concerns: > > > > > > > > > > 1. The most commonly used VMs these days have 64 or fewer > > > > > vCPUs and won't see any performance benefit. > > > > > > > > > > 2. Larger VMs probably won't see the full 15% benefit because > > > > > all vCPUs in the local NUMA node will be assigned IRQs. For > > > > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48 > > > > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining > > > > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1 > > > > > in a way that avoids sharing a core. But overall the means > > > > > that 75% of the IRQs will still be sharing a core and > > > > > presumably not see any perf benefit. > > > > > > > > > > 3. Your experiment was on a relatively small scale: 4 IRQs > > > > > spread across 2 cores vs. across 4 cores. Have you run any > > > > > experiments on VMs with 128 vCPUs (for example) where > > > > > most of the IRQs are not sharing a core? I'm wondering if > > > > > the results with 4 IRQs really scale up to 64 IRQs. A lot can > > > > > be different in a VM with 64 cores and 2 NUMA nodes vs. > > > > > 4 cores in a single node. > > > > > > > > > > 4. The new algorithm prefers assigning to all vCPUs in > > > > > each NUMA hop over assigning to separate cores. Are there > > > > > experiments showing that is the right tradeoff? What > > > > > are the results if assigning to separate cores is preferred? > > > > > > > > I remember in a customer case, putting the IRQs on the same > > > > NUMA node has better perf. But I agree, this should be re-tested > > > > on MANA nic. > > > > > > 1) and 2) The change will not decrease the existing performance, but for > > > system with high number of CPU, will be benefited after this. > > > > > > 3) The result has shown around 6 percent improvement. > > > > > > 4)The test result has shown around 10 percent difference when IRQs are > > > spread on multiple numa nodes. > > > > OK, this looks pretty good. Make clear in the commit messages what > > the tradeoffs are, and what the real-world benefits are expected to be. > > Some future developer who wants to understand why IRQs are assigned > > this way will thank you. :-) > > I agree with Michael, this needs to be spoken aloud. > > >From the above, is that correct that the best performance is achieved > when the # of IRQs is half the nubmer of CPUs in the 1st node, because > this configuration allows to spread IRQs across cores the most optimal > way? And if we have more or less than that, it hurts performance, at > least for MANA networking? It does not decrease the performance from current cpu_local_spread(), but optimum performance comes when node has CPUs double that of number of IRQs (considering SMT==2). Now only if the number of CPUs are same that of number of IRQs, (that is num of CPUs <= 64) then, we see same performance like existing design with cpu_local_spread(). If node has more CPUs than 64, then we get better performance than cpu_local_spread(). > > So, the B|A performance chart may look like this, right? > > irq nodes cores cpus perf > 0 1 | 1 0 | 0 0 | 0-1 0% > 1 1 | 1 0 | 1 1 | 2-3 +5% > 2 1 | 1 1 | 2 2 | 4-5 +10% > 3 1 | 1 1 | 3 3 | 6-7 +15% > 4 1 | 1 0 | 4 3 | 0-1 +12% > ... | | | > 7 1 | 1 1 | 7 3 | 6-7 0% > ... > 15 2 | 2 3 | 3 15 | 14-15 0% > > Souradeep, can you please confirm that my understanding is correct? > > In v5, can you add a table like the above with real performance > numbers for your driver? I think that it would help people to > configure their VMs better when networking is a bottleneck. > I will share a chart on next version of patch 3. Thanks for the suggestion. > Thanks, > Yury