> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> >> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote: >>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@xxxxxxxxxx> wrote: >>> I can boot the system with mlx5_core deny-listed. I log in, remove >>> mlx5_core from the deny list, and then "modprobe mlx5_core" to >>> reproduce the issue while the system is running. >>> >>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: firmware version: 16.35.2000 >>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90 >>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236 >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60 >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W). >>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m->system_map=ffff9a33801990d0 end=236 >>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80 >>> >>> ### >>> >>> The fault address is the cm->managed_map for one of the CPUs. >> >> That does not make any sense at all. The irq matrix is initialized via: >> >> irq_alloc_matrix() >> m = kzalloc(sizeof(matric); >> m->maps = alloc_percpu(*m->maps); >> >> So how is any per CPU map which got allocated there supposed to be >> invalid (not mapped): >> >>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80 >>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read access in kernel mode >>> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000) - not-present page >>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062 >> >> But if you look at the address: 0xffffffffb9ef3f80 >> >> That one is bogus: >> >> managed_map=ffff9a36efcf0f80 >> managed_map=ffff9a36efd30f80 >> managed_map=ffff9a3aefc30f80 >> managed_map=ffff9a3aefc70f80 >> managed_map=ffff9a3aefd30f80 >> managed_map=ffff9a3aefd70f80 >> managed_map=ffffffffb9ef3f80 >> >> Can you spot the fail? >> >> The first six are in the direct map and the last one is in module map, >> which makes no sense at all. > > Indeed. The reason for that is that the affinity mask has bits > set for CPU IDs that are not present on my system. > > After bbac70c74183 ("net/mlx5: Use newer affinity descriptor") > that mask is set up like this: > > struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev) > { > struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev); > - cpumask_var_t req_mask; > + struct irq_affinity_desc af_desc; > struct mlx5_irq *irq; > - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL)) > - return ERR_PTR(-ENOMEM); > - cpumask_copy(req_mask, cpu_online_mask); > + cpumask_copy(&af_desc.mask, cpu_online_mask); > + af_desc.is_managed = false; By the way, why is "is_managed" set to false? This particular system is a NUMA system, and I'd like to be able to set IRQ affinity for the card. Since is_managed is set to false, writing to the /proc/irq files fails with EIO. > Which normally works as you would expect. But for some historical > reason, I have CONFIG_NR_CPUS=32 on my system, and the > cpumask_copy() misbehaves. > > If I correct mlx5_ctrl_irq_request() to clear @af_desc before the > copy, this crash goes away. But mlx5_core crashes during a later > part of its init, in cpu_rmap_update(). cpu_rmap_update() does > exactly the same thing (for_each_cpu() on an affinity mask created > by copying), and crashes in a very similar fashion. > > If I set CONFIG_NR_CPUS to a larger value, like 512, the problem > vanishes entirely, and "modprobe mlx5_core" works as expected. > > Thus I think the problem is with cpumask_copy() or for_each_cpu() > when NR_CPUS is a small value (the default is 8192). > > >> Can you please apply the debug patch below and provide the output? >> >> Thanks, >> >> tglx >> --- >> --- a/kernel/irq/matrix.c >> +++ b/kernel/irq/matrix.c >> @@ -51,6 +51,7 @@ struct irq_matrix { >> unsigned int alloc_end) >> { >> struct irq_matrix *m; >> + unsigned int cpu; >> >> if (matrix_bits > IRQ_MATRIX_BITS) >> return NULL; >> @@ -68,6 +69,8 @@ struct irq_matrix { >> kfree(m); >> return NULL; >> } >> + for_each_possible_cpu(cpu) >> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned long)per_cpu_ptr(m->maps, cpu)); >> return m; >> } >> >> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir >> struct cpumap *cm = per_cpu_ptr(m->maps, cpu); >> unsigned int bit; >> >> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned long)cm); >> + >> bit = matrix_alloc_area(m, cm, 1, true); >> if (bit >= m->alloc_end) >> goto cleanup; > > -- > Chuck Lever -- Chuck Lever