RE: system hang on start-up (mlx5?)

Eli Cohen <elic@xxxxxxxxxx> · Tue, 30 May 2023 13:54:17 +0000



> -----Original Message-----
> From: Chuck Lever III <chuck.lever@xxxxxxxxxx>
> Sent: Tuesday, 30 May 2023 16:51
> To: Eli Cohen <elic@xxxxxxxxxx>
> Cc: Shay Drory <shayd@xxxxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>;
> Saeed Mahameed <saeedm@xxxxxxxxxx>; linux-rdma <linux-
> rdma@xxxxxxxxxxxxxxx>; open list:NETWORKING [GENERAL]
> <netdev@xxxxxxxxxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Subject: Re: system hang on start-up (mlx5?)
> 
> 
> 
> > On May 30, 2023, at 9:48 AM, Eli Cohen <elic@xxxxxxxxxx> wrote:
> >
> >> From: Chuck Lever III <chuck.lever@xxxxxxxxxx>
> >> Sent: Tuesday, 30 May 2023 16:28
> >> To: Eli Cohen <elic@xxxxxxxxxx>
> >> Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Saeed Mahameed
> >> <saeedm@xxxxxxxxxx>; linux-rdma <linux-rdma@xxxxxxxxxxxxxxx>; open
> >> list:NETWORKING [GENERAL] <netdev@xxxxxxxxxxxxxxx>; Thomas Gleixner
> >> <tglx@xxxxxxxxxxxxx>
> >> Subject: Re: system hang on start-up (mlx5?)
> >>
> >>
> >>
> >>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx>
> >> wrote:
> >>>
> >>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> >> wrote:
> >>>>
> >>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
> >>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@xxxxxxxxxx> wrote:
> >>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
> >>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
> >>>>> reproduce the issue while the system is running.
> >>>>>
> >>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> firmware version: 16.35.2000
> >>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
> >> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> Port module event: module 0, Cable plugged
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
> >> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
> >> page fault for address: ffffffffb9ef3f80
> >>>>>
> >>>>> ###
> >>>>>
> >>>>> The fault address is the cm->managed_map for one of the CPUs.
> >>>>
> >>>> That does not make any sense at all. The irq matrix is initialized via:
> >>>>
> >>>> irq_alloc_matrix()
> >>>> m = kzalloc(sizeof(matric);
> >>>> m->maps = alloc_percpu(*m->maps);
> >>>>
> >>>> So how is any per CPU map which got allocated there supposed to be
> >>>> invalid (not mapped):
> >>>>
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
> >> page fault for address: ffffffffb9ef3f80
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
> >> access in kernel mode
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
> error_code(0x0000)
> >> - not-present page
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
> >> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
> >>>>
> >>>> But if you look at the address: 0xffffffffb9ef3f80
> >>>>
> >>>> That one is bogus:
> >>>>
> >>>>   managed_map=ffff9a36efcf0f80
> >>>>   managed_map=ffff9a36efd30f80
> >>>>   managed_map=ffff9a3aefc30f80
> >>>>   managed_map=ffff9a3aefc70f80
> >>>>   managed_map=ffff9a3aefd30f80
> >>>>   managed_map=ffff9a3aefd70f80
> >>>>   managed_map=ffffffffb9ef3f80
> >>>>
> >>>> Can you spot the fail?
> >>>>
> >>>> The first six are in the direct map and the last one is in module map,
> >>>> which makes no sense at all.
> >>>
> >>> Indeed. The reason for that is that the affinity mask has bits
> >>> set for CPU IDs that are not present on my system.
> >>>
> >>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
> >>> that mask is set up like this:
> >>>
> >>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
> >>> {
> >>>       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
> >>> -       cpumask_var_t req_mask;
> >>> +       struct irq_affinity_desc af_desc;
> >>>       struct mlx5_irq *irq;
> >>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> >>> -               return ERR_PTR(-ENOMEM);
> >>> -       cpumask_copy(req_mask, cpu_online_mask);
> >>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
> >>> +       af_desc.is_managed = false;
> >>
> >> By the way, why is "is_managed" set to false?
> >>
> >> This particular system is a NUMA system, and I'd like to be
> >> able to set IRQ affinity for the card. Since is_managed is
> >> set to false, writing to the /proc/irq files fails with EIO.
> >>
> > This is a control irq and is used for issuing configuration commands.
> >
> > This commit:
> > commit c410abbbacb9b378365ba17a30df08b4b9eec64f
> > Author: Dou Liyang <douliyangs@xxxxxxxxx>
> > Date:   Tue Dec 4 23:51:21 2018 +0800
> >
> >    genirq/affinity: Add is_managed to struct irq_affinity_desc
> >
> > explains why it should not be managed.
> 
> Understood, but what about the other IRQs? I can't set any
> of them. All writes to the proc files result in EIO.
> 
I think @Shay Drory has a fix for that should go upstream.
Shay was it sent?
> 
> >>> Which normally works as you would expect. But for some historical
> >>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
> >>> cpumask_copy() misbehaves.
> >>>
> >>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
> >>> copy, this crash goes away. But mlx5_core crashes during a later
> >>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
> >>> exactly the same thing (for_each_cpu() on an affinity mask created
> >>> by copying), and crashes in a very similar fashion.
> >>>
> >>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
> >>> vanishes entirely, and "modprobe mlx5_core" works as expected.
> >>>
> >>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
> >>> when NR_CPUS is a small value (the default is 8192).
> >>>
> >>>
> >>>> Can you please apply the debug patch below and provide the output?
> >>>>
> >>>> Thanks,
> >>>>
> >>>>      tglx
> >>>> ---
> >>>> --- a/kernel/irq/matrix.c
> >>>> +++ b/kernel/irq/matrix.c
> >>>> @@ -51,6 +51,7 @@ struct irq_matrix {
> >>>> unsigned int alloc_end)
> >>>> {
> >>>> struct irq_matrix *m;
> >>>> + unsigned int cpu;
> >>>>
> >>>> if (matrix_bits > IRQ_MATRIX_BITS)
> >>>> return NULL;
> >>>> @@ -68,6 +69,8 @@ struct irq_matrix {
> >>>> kfree(m);
> >>>> return NULL;
> >>>> }
> >>>> + for_each_possible_cpu(cpu)
> >>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
> >> long)per_cpu_ptr(m->maps, cpu));
> >>>> return m;
> >>>> }
> >>>>
> >>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
> >>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
> >>>> unsigned int bit;
> >>>>
> >>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
> >> long)cm);
> >>>> +
> >>>> bit = matrix_alloc_area(m, cm, 1, true);
> >>>> if (bit >= m->alloc_end)
> >>>> goto cleanup;
> >>>
> >>> --
> >>> Chuck Lever
> >>
> >>
> >> --
> >> Chuck Lever
> 
> 
> --
> Chuck Lever
>