> On May 30, 2023, at 11:08 AM, Shay Drory <shayd@xxxxxxxxxx> wrote: > > > On 30/05/2023 16:54, Eli Cohen wrote: >>> -----Original Message----- >>> From: Chuck Lever III <chuck.lever@xxxxxxxxxx> >>> Sent: Tuesday, 30 May 2023 16:51 >>> To: Eli Cohen <elic@xxxxxxxxxx> >>> Cc: Shay Drory <shayd@xxxxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>; >>> Saeed Mahameed <saeedm@xxxxxxxxxx>; linux-rdma <linux- >>> rdma@xxxxxxxxxxxxxxx>; open list:NETWORKING [GENERAL] >>> <netdev@xxxxxxxxxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxxxxx> >>> Subject: Re: system hang on start-up (mlx5?) >>> >>> >>> >>>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@xxxxxxxxxx> wrote: >>>> >>>>> From: Chuck Lever III <chuck.lever@xxxxxxxxxx> >>>>> Sent: Tuesday, 30 May 2023 16:28 >>>>> To: Eli Cohen <elic@xxxxxxxxxx> >>>>> Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Saeed Mahameed >>>>> <saeedm@xxxxxxxxxx>; linux-rdma <linux-rdma@xxxxxxxxxxxxxxx>; open >>>>> list:NETWORKING [GENERAL] <netdev@xxxxxxxxxxxxxxx>; Thomas Gleixner >>>>> <tglx@xxxxxxxxxxxxx> >>>>> Subject: Re: system hang on start-up (mlx5?) >>>>> >>>>> >>>>> >>>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx> >>>>> wrote: >>>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> >>>>> wrote: >>>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote: >>>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@xxxxxxxxxx> wrote: >>>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove >>>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to >>>>>>>> reproduce the issue while the system is running. >>>>>>>> >>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core >>> 0000:81:00.0: >>>>> firmware version: 16.35.2000 >>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core >>> 0000:81:00.0: >>>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: >>>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90 >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236 >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core >>> 0000:81:00.0: >>>>> Port module event: module 0, Cable plugged >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: >>>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60 >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core >>> 0000:81:00.0: >>>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W). >>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 >>> end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 >>> end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 >>> end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 >>> end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 >>> end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m- >>>>>> system_map=ffff9a33801990d0 end=236 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle >>>>> page fault for address: ffffffffb9ef3f80 >>>>>>>> ### >>>>>>>> >>>>>>>> The fault address is the cm->managed_map for one of the CPUs. >>>>>>> That does not make any sense at all. The irq matrix is initialized via: >>>>>>> >>>>>>> irq_alloc_matrix() >>>>>>> m = kzalloc(sizeof(matric); >>>>>>> m->maps = alloc_percpu(*m->maps); >>>>>>> >>>>>>> So how is any per CPU map which got allocated there supposed to be >>>>>>> invalid (not mapped): >>>>>>> >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle >>>>> page fault for address: ffffffffb9ef3f80 >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read >>>>> access in kernel mode >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: >>> error_code(0x0000) >>>>> - not-present page >>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D >>>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062 >>>>>>> But if you look at the address: 0xffffffffb9ef3f80 >>>>>>> >>>>>>> That one is bogus: >>>>>>> >>>>>>> managed_map=ffff9a36efcf0f80 >>>>>>> managed_map=ffff9a36efd30f80 >>>>>>> managed_map=ffff9a3aefc30f80 >>>>>>> managed_map=ffff9a3aefc70f80 >>>>>>> managed_map=ffff9a3aefd30f80 >>>>>>> managed_map=ffff9a3aefd70f80 >>>>>>> managed_map=ffffffffb9ef3f80 >>>>>>> >>>>>>> Can you spot the fail? >>>>>>> >>>>>>> The first six are in the direct map and the last one is in module map, >>>>>>> which makes no sense at all. >>>>>> Indeed. The reason for that is that the affinity mask has bits >>>>>> set for CPU IDs that are not present on my system. >>>>>> >>>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor") >>>>>> that mask is set up like this: >>>>>> >>>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev) >>>>>> { >>>>>> struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev); >>>>>> - cpumask_var_t req_mask; >>>>>> + struct irq_affinity_desc af_desc; >>>>>> struct mlx5_irq *irq; >>>>>> - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL)) >>>>>> - return ERR_PTR(-ENOMEM); >>>>>> - cpumask_copy(req_mask, cpu_online_mask); >>>>>> + cpumask_copy(&af_desc.mask, cpu_online_mask); >>>>>> + af_desc.is_managed = false; >>>>> By the way, why is "is_managed" set to false? >>>>> >>>>> This particular system is a NUMA system, and I'd like to be >>>>> able to set IRQ affinity for the card. Since is_managed is >>>>> set to false, writing to the /proc/irq files fails with EIO. >>>>> >>>> This is a control irq and is used for issuing configuration commands. >>>> >>>> This commit: >>>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f >>>> Author: Dou Liyang <douliyangs@xxxxxxxxx> >>>> Date: Tue Dec 4 23:51:21 2018 +0800 >>>> >>>> genirq/affinity: Add is_managed to struct irq_affinity_desc >>>> >>>> explains why it should not be managed. >>> Understood, but what about the other IRQs? I can't set any >>> of them. All writes to the proc files result in EIO. >>> >> I think @Shay Drory has a fix for that should go upstream. >> Shay was it sent? > > The fix was send and merged. > > https://lore.kernel.org/all/20230523054242.21596-15-saeed@xxxxxxxxxx/ Fwiw, I'm now on v6.4-rc4, and setting IRQ affinity works as expected. Sorry for the noise and thanks for the fix. >>>>>> Which normally works as you would expect. But for some historical >>>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the >>>>>> cpumask_copy() misbehaves. >>>>>> >>>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the >>>>>> copy, this crash goes away. But mlx5_core crashes during a later >>>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does >>>>>> exactly the same thing (for_each_cpu() on an affinity mask created >>>>>> by copying), and crashes in a very similar fashion. >>>>>> >>>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem >>>>>> vanishes entirely, and "modprobe mlx5_core" works as expected. >>>>>> >>>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu() >>>>>> when NR_CPUS is a small value (the default is 8192). >>>>>> >>>>>> >>>>>>> Can you please apply the debug patch below and provide the output? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> tglx >>>>>>> --- >>>>>>> --- a/kernel/irq/matrix.c >>>>>>> +++ b/kernel/irq/matrix.c >>>>>>> @@ -51,6 +51,7 @@ struct irq_matrix { >>>>>>> unsigned int alloc_end) >>>>>>> { >>>>>>> struct irq_matrix *m; >>>>>>> + unsigned int cpu; >>>>>>> >>>>>>> if (matrix_bits > IRQ_MATRIX_BITS) >>>>>>> return NULL; >>>>>>> @@ -68,6 +69,8 @@ struct irq_matrix { >>>>>>> kfree(m); >>>>>>> return NULL; >>>>>>> } >>>>>>> + for_each_possible_cpu(cpu) >>>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned >>>>> long)per_cpu_ptr(m->maps, cpu)); >>>>>>> return m; >>>>>>> } >>>>>>> >>>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir >>>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu); >>>>>>> unsigned int bit; >>>>>>> >>>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned >>>>> long)cm); >>>>>>> + >>>>>>> bit = matrix_alloc_area(m, cm, 1, true); >>>>>>> if (bit >= m->alloc_end) >>>>>>> goto cleanup; >>>>>> -- >>>>>> Chuck Lever >>>>> >>>>> -- >>>>> Chuck Lever >>> >>> -- >>> Chuck Lever -- Chuck Lever