On Tue, Jan 09, 2024 at 03:28:59PM -0800, Yury Norov wrote: > Hi Michael, > > So, I'm just a guy who helped to formulate the heuristics in an > itemized form, and implement them using the existing kernel API. > I have no access to MANA machines and I ran no performance tests > myself. > > On Tue, Jan 09, 2024 at 07:22:38PM +0000, Michael Kelley wrote: > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: Tuesday, January 9, 2024 2:51 AM > > > > > > From: Yury Norov <yury.norov@xxxxxxxxx> > > > > > > Souradeep investigated that the driver performs faster if IRQs are > > > spread on CPUs with the following heuristics: > > > > > > 1. No more than one IRQ per CPU, if possible; > > > 2. NUMA locality is the second priority; > > > 3. Sibling dislocality is the last priority. > > > > > > Let's consider this topology: > > > > > > Node 0 1 > > > Core 0 1 2 3 > > > CPU 0 1 2 3 4 5 6 7 > > > > > > The most performant IRQ distribution based on the above topology > > > and heuristics may look like this: > > > > > > IRQ Nodes Cores CPUs > > > 0 1 0 0-1 > > > 1 1 1 2-3 > > > 2 1 0 0-1 > > > 3 1 1 2-3 > > > 4 2 2 4-5 > > > 5 2 3 6-7 > > > 6 2 2 4-5 > > > 7 2 3 6-7 > > > > I didn't pay attention to the detailed discussion of this issue > > over the past 2 to 3 weeks during the holidays in the U.S., but > > the above doesn't align with the original problem as I understood > > it. I thought the original problem was to avoid putting IRQs on > > both hyper-threads in the same core, and that the perf > > improvements are based on that configuration. At least that's > > what the commit message for Patch 4/4 in this series says. > > Yes, and the original distribution suggested by Souradeep looks very > similar: > > IRQ Nodes Cores CPUs > 0 1 0 0 > 1 1 1 2 > 2 1 0 1 > 3 1 1 3 > 4 2 2 4 > 5 2 3 6 > 6 2 2 5 > 7 2 3 7 > > I just added a bit more flexibility, so that kernel may pick any > sibling for the IRQ. As I understand, both approaches have similar > performance. Probably my fine-tune added another half-percent... > > Souradeep, can you please share the exact numbers on this? > > > The above chart results in 8 IRQs being assigned to the 8 CPUs, > > probably with 1 IRQ per CPU. At least on x86, if the affinity > > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu() > > should balance the IRQ assignments between the CPUs in the mask. > > So the original problem is still present because both hyper-threads > > in a core are likely to have an IRQ assigned. > > That's what I think, if the topology makes us to put IRQs in the > same sibling group, the best thing we can to is to rely on existing > balancing mechanisms in a hope that they will do their job well. > > > Of course, this example has 8 IRQs and 8 CPUs, so assigning an > > IRQ to every hyper-thread may be the only choice. If that's the > > case, maybe this just isn't a good example to illustrate the > > original problem and solution. > > Yeah... This example illustrates the order of IRQ distribution. > I really doubt that if we distribute IRQs like in the above example, > there would be any difference in performance. But I think it's quite > a good illustration. I could write the title for the table like this: > > The order of IRQ distribution for the best performance > based on [...] may look like this. > > > But even with a better example > > where the # of IRQs is <= half the # of CPUs in a NUMA node, > > I don't think the code below accomplishes the original intent. > > > > Maybe I've missed something along the way in getting to this > > version of the patch. Please feel free to set me straight. :-) > > Hmm. So if the number of IRQs is the half # of CPUs in the nodes, > which is 2 in the example above, the distribution will look like > this: > > IRQ Nodes Cores CPUs > 0 1 0 0-1 > 1 1 1 2-3 > > And each IRQ belongs to a different sibling group. This follows > the rules above. > > I think of it like we assign an IRQ to a group of 2 CPUs, so from > the heuristic #1 perspective, each CPU is assigned with 1/2 of the > IRQ. > > If I add one more IRQ, then according to the heuristics, NUMA locality > trumps sibling dislocality, so we'd assign IRO to the same node on any > core. My algorithm assigns it to the core #0: > > 2 1 0 0-1 > > This doubles # of IRQs for the CPUs 0 and 1: from 1/2 to 1. > > The next IRQ should be assigned to the same node again, and we've got > the only choice: > > > 3 1 1 2-3 > > Starting from IRQ #5, the node #1 is full - each CPU is assigned with > exactly one IRQ, and the heuristic #1 makes us to switch to the other > node; and then do the same thing: > > 4 2 2 4-5 > 5 2 3 6-7 > 6 2 2 4-5 > 7 2 3 6-7 > > So I think the algorithm is correct... Really hope the above makes > sense. :) If so, I can add it to the commit message for patch #3. > > Nevertheless... Souradeep, in addition to the performance numbers, can > you share your topology and actual IRQ distribution that gains 15%? I > think it should be added to the patch #4 commit message. Sure I will add my topology in #4 commit message. Thanks for the suggestion. > > Thanks, > Yury