> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > On Tue, May 30 2023 at 21:48, Chuck Lever III wrote: >>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >>> Can you please add after the cpumask_copy() in that mlx5 code: >>> >>> pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]); >>> pr_info("MASKBITS: %016lx\n", af_desc.mask.bits[0]); >> >> Both are 0000 0000 0000 0fff, as expected on a system >> where 12 CPUs are present. > > So the non-initialized mask on stack has the online bits correctly > copied and bits 12-63 are cleared, which is what cpumask_copy() > achieves by copying longs and not bits. > >> [ 71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8 > > How can that end up with a completely different content here? > > As I said before that's clearly a direct map address. > > So the call chain is: > > mlx5_irq_alloc(af_desc) > pci_msix_alloc_irq_at(af_desc) > msi_domain_alloc_irq_at(af_desc) > __msi_domain_alloc_irqs(af_desc) > 1) msidesc->affinity = kmemdup(af_desc); > __irq_domain_alloc_irqs() > __irq_domain_alloc_irqs(aff=msidesc->affinity) > irq_domain_alloc_irqs_locked(aff) > irq_domain_alloc_irqs_locked(aff) > irq_domain_alloc_descs(aff) > alloc_desc(mask=&aff->mask) > desc_smp_init(mask) > 2) cpumask_copy(desc->irq_common_data.affinity, mask); > irq_domain_alloc_irqs_hierarchy() > msi_domain_alloc() > intel_irq_remapping_alloc() > x86_vector_alloc_irqs() It is x86_vector_alloc_irqs() where the struct irq_data is fabricated that ends up in irq_matrix_reserve_managed(). > reserve_managed_vector() > mask = desc->irq_common_data.affinity; > irq_matrix_reserve_managed(mask) > > So af_desc is kmemdup'ed at #1 and then the result is copied in #2. > > Anything else just hands pointers around. So where gets either af_desc > or msidesc->affinity or desc->irq_common_data.affinity overwritten? Or > one of the pointers mangled. I doubt that it's the latter as this is 99% > generic code which would end up in random fails all over the place. > > This also ends up in the wrong place. That mlx code does: > > af_desc.is_managed = false; > > but the allocation ends up allocating a managed vector. That line was changed in 6.4-rc4 to address another bug, and it avoids the crash by not calling into the misbehaving code. It doesn't address the mlx5_core initialization issue though, because as I said before, execution continues and crashes in a similar scenario later on. On my system, I've reverted that fix: - af_desc.is_managed = false; + af_desc.is_managed = 1; so that we can continue debugging this crash. > Can you please instrument this along the call chain so we can see where > or at least when this gets corrupted? Please print the relevant pointer > addresses too so we can see whether that's consistent or not. I will continue working on this today. >> The lowest 16 bits of MASKBITS are bcd8, or in binary: >> >> ... 1011 1100 1101 1000 >> >> Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and >> 12, matching the CPU IDs from the loop. At bit 12, we fault, >> since there is no CPU 12 on the system. > > Thats due to a dubious optimization from Linus: > > #if NR_CPUS <= BITS_PER_LONG > #define small_cpumask_bits ((unsigned int)NR_CPUS) > #define large_cpumask_bits ((unsigned int)NR_CPUS) > #elif NR_CPUS <= 4*BITS_PER_LONG > #define small_cpumask_bits nr_cpu_ids > > small_cpumask_bits is not nr_cpu_ids(12), it's NR_CPUS(32) which is why > the loop does not terminate. Bah! > > But that's just the symptom, not the root cause. That code is perfectly > fine when all callers use the proper cpumask functions. Agreed: we're crashing here because of the extra bits in the affinity mask, but those bits should not be set in the first place. I wasn't sure if for_each_cpu() was supposed to iterate into non-present CPUs -- and I guess the answer is yes, it will iterate the full length of the mask. The caller is responsible for ensuring the mask is valid. -- Chuck Lever