Re: system hang on start-up (mlx5?)

Shay Drory <shayd@xxxxxxxxxx> · Tue, 30 May 2023 18:08:21 +0300

On 30/05/2023 16:54, Eli Cohen wrote:
-----Original Message-----
From: Chuck Lever III <chuck.lever@xxxxxxxxxx>
Sent: Tuesday, 30 May 2023 16:51
To: Eli Cohen <elic@xxxxxxxxxx>
Cc: Shay Drory <shayd@xxxxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>;
Saeed Mahameed <saeedm@xxxxxxxxxx>; linux-rdma <linux-
rdma@xxxxxxxxxxxxxxx>; open list:NETWORKING [GENERAL]
<netdev@xxxxxxxxxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Subject: Re: system hang on start-up (mlx5?)



On May 30, 2023, at 9:48 AM, Eli Cohen <elic@xxxxxxxxxx> wrote:

From: Chuck Lever III <chuck.lever@xxxxxxxxxx>
Sent: Tuesday, 30 May 2023 16:28
To: Eli Cohen <elic@xxxxxxxxxx>
Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Saeed Mahameed
<saeedm@xxxxxxxxxx>; linux-rdma <linux-rdma@xxxxxxxxxxxxxxx>; open
list:NETWORKING [GENERAL] <netdev@xxxxxxxxxxxxxxx>; Thomas Gleixner
<tglx@xxxxxxxxxxxxx>
Subject: Re: system hang on start-up (mlx5?)



On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx>
wrote:
On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx>
wrote:
On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
On May 7, 2023, at 1:31 AM, Eli Cohen <elic@xxxxxxxxxx> wrote:
I can boot the system with mlx5_core deny-listed. I log in, remove
mlx5_core from the deny list, and then "modprobe mlx5_core" to
reproduce the issue while the system is running.

May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
0000:81:00.0:
firmware version: 16.35.2000
May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
0000:81:00.0:
126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
0000:81:00.0:
Port module event: module 0, Cable plugged
May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
0000:81:00.0:
mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
page fault for address: ffffffffb9ef3f80
###

The fault address is the cm->managed_map for one of the CPUs.
That does not make any sense at all. The irq matrix is initialized via:

irq_alloc_matrix()
m = kzalloc(sizeof(matric);
m->maps = alloc_percpu(*m->maps);

So how is any per CPU map which got allocated there supposed to be
invalid (not mapped):

May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
page fault for address: ffffffffb9ef3f80
May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
access in kernel mode
May 27 15:47:47 manet.1015granger.net kernel: #PF:
error_code(0x0000)
- not-present page
May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
But if you look at the address: 0xffffffffb9ef3f80

That one is bogus:

   managed_map=ffff9a36efcf0f80
   managed_map=ffff9a36efd30f80
   managed_map=ffff9a3aefc30f80
   managed_map=ffff9a3aefc70f80
   managed_map=ffff9a3aefd30f80
   managed_map=ffff9a3aefd70f80
   managed_map=ffffffffb9ef3f80

Can you spot the fail?

The first six are in the direct map and the last one is in module map,
which makes no sense at all.
Indeed. The reason for that is that the affinity mask has bits
set for CPU IDs that are not present on my system.

After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
that mask is set up like this:

struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
{
       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
-       cpumask_var_t req_mask;
+       struct irq_affinity_desc af_desc;
       struct mlx5_irq *irq;
-       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
-               return ERR_PTR(-ENOMEM);
-       cpumask_copy(req_mask, cpu_online_mask);
+       cpumask_copy(&af_desc.mask, cpu_online_mask);
+       af_desc.is_managed = false;
By the way, why is "is_managed" set to false?

This particular system is a NUMA system, and I'd like to be
able to set IRQ affinity for the card. Since is_managed is
set to false, writing to the /proc/irq files fails with EIO.

This is a control irq and is used for issuing configuration commands.

This commit:
commit c410abbbacb9b378365ba17a30df08b4b9eec64f
Author: Dou Liyang <douliyangs@xxxxxxxxx>
Date:   Tue Dec 4 23:51:21 2018 +0800

    genirq/affinity: Add is_managed to struct irq_affinity_desc

explains why it should not be managed.
Understood, but what about the other IRQs? I can't set any
of them. All writes to the proc files result in EIO.

I think @Shay Drory has a fix for that should go upstream.
Shay was it sent?

The fix was send and merged.

https://lore.kernel.org/all/20230523054242.21596-15-saeed@xxxxxxxxxx/
Which normally works as you would expect. But for some historical
reason, I have CONFIG_NR_CPUS=32 on my system, and the
cpumask_copy() misbehaves.

If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
copy, this crash goes away. But mlx5_core crashes during a later
part of its init, in cpu_rmap_update(). cpu_rmap_update() does
exactly the same thing (for_each_cpu() on an affinity mask created
by copying), and crashes in a very similar fashion.

If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
vanishes entirely, and "modprobe mlx5_core" works as expected.

Thus I think the problem is with cpumask_copy() or for_each_cpu()
when NR_CPUS is a small value (the default is 8192).


Can you please apply the debug patch below and provide the output?

Thanks,

      tglx
---

--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -51,6 +51,7 @@ struct irq_matrix {
unsigned int alloc_end)
{
struct irq_matrix *m;
+ unsigned int cpu;

if (matrix_bits > IRQ_MATRIX_BITS)
return NULL;
@@ -68,6 +69,8 @@ struct irq_matrix {
kfree(m);
return NULL;
}
+ for_each_possible_cpu(cpu)
+ pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
long)per_cpu_ptr(m->maps, cpu));
return m;
}

@@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
unsigned int bit;

+ pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
long)cm);
+
bit = matrix_alloc_area(m, cm, 1, true);
if (bit >= m->alloc_end)
goto cleanup;
--
Chuck Lever

--
Chuck Lever

--
Chuck Lever