On Tue, Jan 09, 2024 at 08:20:31PM +0000, Haiyang Zhang wrote: > > > > -----Original Message----- > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> > > Sent: Tuesday, January 9, 2024 2:23 PM > > To: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>; KY Srinivasan > > <kys@xxxxxxxxxxxxx>; Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; > > wei.liu@xxxxxxxxxx; Dexuan Cui <decui@xxxxxxxxxxxxx>; > > davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx; kuba@xxxxxxxxxx; > > pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>; yury.norov@xxxxxxxxx; > > leon@xxxxxxxxxx; cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; > > vkuznets@xxxxxxxxxx; tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; > > netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux- > > rdma@xxxxxxxxxxxxxxx > > Cc: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>; Paul Rosswurm > > <paulros@xxxxxxxxxxxxx> > > Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per > > CPUs > > > > [Some people who received this message don't often get email from > > mhklinux@xxxxxxxxxxx. Learn why this is important at > > https://aka.ms/LearnAboutSenderIdentification ] > > > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: > > Tuesday, January 9, 2024 2:51 AM > > > > > > From: Yury Norov <yury.norov@xxxxxxxxx> > > > > > > Souradeep investigated that the driver performs faster if IRQs are > > > spread on CPUs with the following heuristics: > > > > > > 1. No more than one IRQ per CPU, if possible; > > > 2. NUMA locality is the second priority; > > > 3. Sibling dislocality is the last priority. > > > > > > Let's consider this topology: > > > > > > Node 0 1 > > > Core 0 1 2 3 > > > CPU 0 1 2 3 4 5 6 7 > > > > > > The most performant IRQ distribution based on the above topology > > > and heuristics may look like this: > > > > > > IRQ Nodes Cores CPUs > > > 0 1 0 0-1 > > > 1 1 1 2-3 > > > 2 1 0 0-1 > > > 3 1 1 2-3 > > > 4 2 2 4-5 > > > 5 2 3 6-7 > > > 6 2 2 4-5 > > > 7 2 3 6-7 > > > > I didn't pay attention to the detailed discussion of this issue > > over the past 2 to 3 weeks during the holidays in the U.S., but > > the above doesn't align with the original problem as I understood > > it. I thought the original problem was to avoid putting IRQs on > > both hyper-threads in the same core, and that the perf > > improvements are based on that configuration. At least that's > > what the commit message for Patch 4/4 in this series says. > > > > The above chart results in 8 IRQs being assigned to the 8 CPUs, > > probably with 1 IRQ per CPU. At least on x86, if the affinity > > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu() > > should balance the IRQ assignments between the CPUs in the mask. > > So the original problem is still present because both hyper-threads > > in a core are likely to have an IRQ assigned. > > > > Of course, this example has 8 IRQs and 8 CPUs, so assigning an > > IRQ to every hyper-thread may be the only choice. If that's the > > case, maybe this just isn't a good example to illustrate the > > original problem and solution. But even with a better example > > where the # of IRQs is <= half the # of CPUs in a NUMA node, > > I don't think the code below accomplishes the original intent. > > > > Maybe I've missed something along the way in getting to this > > version of the patch. Please feel free to set me straight. :-) > > > > Michael > > I have the same question as Michael. Also, I'm asking Souradeep > in another channel: So, the algorithm still uses up all current > NUMA node before moving on to the next NUMA node, right? > > Except each IRQ is affinitized to 2 CPUs. > For example, a system with 2 IRQs: > IRQ Nodes Cores CPUs > 0 1 0 0-1 > 1 1 1 2-3 > > Is this performing better than the algorithm in earlier patches? like below: > IRQ Nodes Cores CPUs > 0 1 0 0 > 1 1 1 2 > The details for this approach has been shared by Yury later in this thread. The main intention with this approach is kernel may pick any sibling for the IRQ. > Thanks, > - Haiyang