Re: [PATCH v5 3/3] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints

Tariq Toukan <ttoukan.linux@xxxxxxxxx> · Mon, 24 Oct 2022 14:24:58 +0300

On 10/21/2022 3:19 PM, Valentin Schneider wrote:
From: Tariq Toukan <tariqt@xxxxxxxxxx>

In the IRQ affinity hints, replace the binary NUMA preference (local /
remote) with the improved for_each_numa_hop_cpu() API that minds the
actual distances, so that remote NUMAs with short distance are preferred
over farther ones.

This has significant performance implications when using NUMA-aware
allocated memory (follow [1] and derivatives for example).

[1]
drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
    int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));

Performance tests:

TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

+-------------------------+-----------+------------------+------------------+
|                         | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline                | 52.3      | 6.4 %            | 17.9 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
+-------------------------+-----------+------------------+------------------+

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

* CPU util on active cores only.

Setups details (similar for both sides):

NIC: ConnectX6-DX dual port, 100 Gbps each.
Single port used in the tests.

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        16
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             2594.804
BogoMIPS:            4890.73
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7,128-135
NUMA node1 CPU(s):   8-15,136-143
NUMA node2 CPU(s):   16-23,144-151
NUMA node3 CPU(s):   24-31,152-159
NUMA node4 CPU(s):   32-39,160-167
NUMA node5 CPU(s):   40-47,168-175
NUMA node6 CPU(s):   48-55,176-183
NUMA node7 CPU(s):   56-63,184-191
NUMA node8 CPU(s):   64-71,192-199
NUMA node9 CPU(s):   72-79,200-207
NUMA node10 CPU(s):  80-87,208-215
NUMA node11 CPU(s):  88-95,216-223
NUMA node12 CPU(s):  96-103,224-231
NUMA node13 CPU(s):  104-111,232-239
NUMA node14 CPU(s):  112-119,240-247
NUMA node15 CPU(s):  120-127,248-255
..
...

Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>
[Tweaked API use]

Thanks for your modification.
It looks good to me.

Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>

Signed-off-by: Valentin Schneider <vschneid@xxxxxxxxxx>
---
  drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 ++++++++++++++++--
  1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index a0242dc15741c..7acbeb3d51846 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -812,9 +812,12 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
  static int comp_irqs_request(struct mlx5_core_dev *dev)
  {
  	struct mlx5_eq_table *table = dev->priv.eq_table;
+	const struct cpumask *prev = cpu_none_mask;
+	const struct cpumask *mask;
  	int ncomp_eqs = table->num_comp_eqs;
  	u16 *cpus;
  	int ret;
+	int cpu;
  	int i;
  
  	ncomp_eqs = table->num_comp_eqs;
@@ -833,8 +836,19 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
  		ret = -ENOMEM;
  		goto free_irqs;
  	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	i = 0;
+	rcu_read_lock();
+	for_each_numa_hop_mask(mask, dev->priv.numa_node) {
+		for_each_cpu_andnot(cpu, mask, prev) {
+			cpus[i] = cpu;
+			if (++i == ncomp_eqs)
+				goto spread_done;
+		}
+		prev = mask;
+	}
+spread_done:
+	rcu_read_unlock();
  	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
  	kfree(cpus);
  	if (ret < 0)