RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> · Fri, 12 Jan 2024 18:30:44 +0000

> -----Original Message-----
> From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> Sent: Friday, January 12, 2024 11:37 AM
> To: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> Cc: Yury Norov <yury.norov@xxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>;
> Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui
> <decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx;
> kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>;
> leon@xxxxxxxxxx; cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx;
> vkuznets@xxxxxxxxxx; tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx;
> netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
> rdma@xxxxxxxxxxxxxxx; Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>;
> Paul Rosswurm <paulros@xxxxxxxxxxxxx>
> Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread
> IRQs per CPUs
> 
> [Some people who received this message don't often get email from
> mhklinux@xxxxxxxxxxx. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent:
> Wednesday, January 10, 2024 10:13 PM
> >
> > The test topology was used to check the performance between
> > cpu_local_spread() and the new approach is :
> > Case 1
> > IRQ     Nodes  Cores CPUs
> > 0       1      0     0-1
> > 1       1      1     2-3
> > 2       1      2     4-5
> > 3       1      3     6-7
> >
> > and with existing cpu_local_spread()
> > Case 2
> > IRQ    Nodes  Cores CPUs
> > 0      1      0     0
> > 1      1      0     1
> > 2      1      1     2
> > 3      1      1     3
> >
> > Total 4 channels were used, which was set up by ethtool.
> > case 1 with ntttcp has given 15 percent better performance, than
> > case 2. During the test irqbalance was disabled as well.
> >
> > Also you are right, with 64CPU system this approach will spread
> > the irqs like the cpu_local_spread() but in the future we will offer
> > MANA nodes, with more than 64 CPUs. There it this new design will
> > give better performance.
> >
> > I will add this performance benefit details in commit message of
> > next version.
> 
> Here are my concerns:
> 
> 1.  The most commonly used VMs these days have 64 or fewer
> vCPUs and won't see any performance benefit.
> 
> 2.  Larger VMs probably won't see the full 15% benefit because
> all vCPUs in the local NUMA node will be assigned IRQs.  For
> example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> vCPUs in NUMA node 0 will all be assigned IRQs.  The remaining
> 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> in a way that avoids sharing a core.  But overall the means
> that 75% of the IRQs will still be sharing a core and
> presumably not see any perf benefit.
> 
> 3.  Your experiment was on a relatively small scale:   4 IRQs
> spread across 2 cores vs. across 4 cores.  Have you run any
> experiments on VMs with 128 vCPUs (for example) where
> most of the IRQs are not sharing a core?  I'm wondering if
> the results with 4 IRQs really scale up to 64 IRQs.  A lot can
> be different in a VM with 64 cores and 2 NUMA nodes vs.
> 4 cores in a single node.
> 
> 4.  The new algorithm prefers assigning to all vCPUs in
> each NUMA hop over assigning to separate cores.  Are there
> experiments showing that is the right tradeoff?  What
> are the results if assigning to separate cores is preferred?

I remember in a customer case, putting the IRQs on the same 
NUMA node has better perf. But I agree, this should be re-tested
on MANA nic.

- Haiyang