> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > Chuck! > > On Tue, May 30 2023 at 13:09, Chuck Lever III wrote: >>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >>> But if you look at the address: 0xffffffffb9ef3f80 >>> >>> That one is bogus: >>> >>> managed_map=ffff9a36efcf0f80 >>> managed_map=ffff9a36efd30f80 >>> managed_map=ffff9a3aefc30f80 >>> managed_map=ffff9a3aefc70f80 >>> managed_map=ffff9a3aefd30f80 >>> managed_map=ffff9a3aefd70f80 >>> managed_map=ffffffffb9ef3f80 >>> >>> Can you spot the fail? >>> >>> The first six are in the direct map and the last one is in module map, >>> which makes no sense at all. >> >> Indeed. The reason for that is that the affinity mask has bits >> set for CPU IDs that are not present on my system. > > Which I don't buy, but even if so then this should not make > for_each_cpu() go south. See below. > >> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor") >> that mask is set up like this: >> >> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev) >> { >> struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev); >> - cpumask_var_t req_mask; >> + struct irq_affinity_desc af_desc; > > That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack... > >> struct mlx5_irq *irq; >> - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL)) >> - return ERR_PTR(-ENOMEM); >> - cpumask_copy(req_mask, cpu_online_mask); >> + cpumask_copy(&af_desc.mask, cpu_online_mask); >> + af_desc.is_managed = false; >> >> Which normally works as you would expect. But for some historical >> reason, I have CONFIG_NR_CPUS=32 on my system, and the >> cpumask_copy() misbehaves. >> >> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the >> copy, this crash goes away. But mlx5_core crashes during a later >> part of its init, in cpu_rmap_update(). cpu_rmap_update() does >> exactly the same thing (for_each_cpu() on an affinity mask created >> by copying), and crashes in a very similar fashion. >> >> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem >> vanishes entirely, and "modprobe mlx5_core" works as expected. >> >> Thus I think the problem is with cpumask_copy() or for_each_cpu() >> when NR_CPUS is a small value (the default is 8192). > > I don't buy any of this. Here is why: > > cpumask_copy(d, s) > bitmap_copy(d, s, nbits = 32) > len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); > > So it copies as many longs as required to cover nbits, i.e. it copies > any clobbered bits beyond nbits too. While that looks odd at the first > glance, that's just an optimization which is harmless. > > for_each_cpu() finds the next set bit in a mask and breaks the loop once > bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too. > > I just booted a kernel with NR_CPUS=32: My system has only 12 CPUs. So every bit in your mask represents a present CPU, but on my system, only 0x00000fff are ever present. Therefore, on my system, any bit higher than bit 11 in a CPU mask will reference a CPU that is not present. > [ 0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32 > [ 0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs > ... > [ 0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1 > > and added a function which does: > > struct cpumask ma, mb; > int cpu; > > memset(&ma, 0xaa, sizeof(ma); > cpumask_copy(&mb, &ma); > pr_info("MASKBITS: %016lx\n", mb.bits[0]); > pr_info("CPUs:"); > for_each_cpu(cpu, &mb) > pr_cont(" %d", cpu); > pr_cont("\n"); > > [ 2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa > [ 2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 > > and the same exercise with a 0x55 filled source bitmap. > > [ 2.167595] smp: MASKBITS: 0x5555555555555555 > [ 2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 > > So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does > the right thing simple because of this: > > nr_cpu_ids is 32, right? > > for_each_cpu(cpu, mask) > for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids) > for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++) > > So if find_next_bit() returns a bit after bit 31 the condition (bit) < > (size) will terminate the loop, right? Again, you are assuming more CPUs than there are bits in the mask. > Also in the case of that driver the copy is _NOT_ copying set bits > beyond bit 31 simply because the source mask is cpu_online_mask which > definitely does not have a bit set which is greater than 31. As the copy > copies longs the resulting mask in af_desc.mask cannot have any bit set > past bit 31 either. > > If the above is not correct, then there is a bigger problem than that > MLX5 driver crashing. > > So this: > >> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the >> copy, this crash goes away. > > does not make any sense to me. > > Can you please add after the cpumask_copy() in that mlx5 code: > > pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]); > pr_info("MASKBITS: %016lx\n", af_desc.mask.bits[0]); Both are 0000 0000 0000 0fff, as expected on a system where 12 CPUs are present. > Please print also in irq_matrix_reserve_managed(): > > - @mask->bits[0] > - nr_cpu_ids > - the CPU numbers inside the for_each_cpu() loop Here's where it gets interesting: + pr_info("%s: MASKBITS: %016lx\n", __func__, msk->bits[0]); + pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids); [ 70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000 [ 70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 71.260555][ C9] port_module: 1 callbacks suppressed [ 71.260561][ C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged [ 71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8 [ 71.274096][ T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W). [ 71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12 [ 71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3 [ 71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4 [ 71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6 [ 71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7 [ 71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10 [ 71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11 [ 71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12 Notice that there are in fact higher bits set than bit 11. The lowest 16 bits of MASKBITS are bcd8, or in binary: ... 1011 1100 1101 1000 Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and 12, matching the CPU IDs from the loop. At bit 12, we fault, since there is no CPU 12 on the system. -- Chuck Lever