Re: system hang on start-up (mlx5?)

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Wed, 31 May 2023 16:43:15 +0200

On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> Can you please add after the cpumask_copy() in that mlx5 code:
>> 
>>    pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>>    pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);
>
> Both are 0000 0000 0000 0fff, as expected on a system
> where 12 CPUs are present.

So the non-initialized mask on stack has the online bits correctly
copied and bits 12-63 are cleared, which is what cpumask_copy()
achieves by copying longs and not bits.

> [   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8

How can that end up with a completely different content here?

As I said before that's clearly a direct map address.

So the call chain is:

mlx5_irq_alloc(af_desc)
  pci_msix_alloc_irq_at(af_desc)
    msi_domain_alloc_irq_at(af_desc)
      __msi_domain_alloc_irqs(af_desc)
1)      msidesc->affinity = kmemdup(af_desc);
        __irq_domain_alloc_irqs()
          __irq_domain_alloc_irqs(aff=msidesc->affinity)
            irq_domain_alloc_irqs_locked(aff)
              irq_domain_alloc_irqs_locked(aff)
                irq_domain_alloc_descs(aff)
                  alloc_desc(mask=&aff->mask)
                    desc_smp_init(mask)
2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
                irq_domain_alloc_irqs_hierarchy()
                  msi_domain_alloc()
                    intel_irq_remapping_alloc()
                      x86_vector_alloc_irqs()
                        reserve_managed_vector()
                          mask = desc->irq_common_data.affinity;
                          irq_matrix_reserve_managed(mask)

So af_desc is kmemdup'ed at #1 and then the result is copied in #2.

Anything else just hands pointers around. So where gets either af_desc
or msidesc->affinity or desc->irq_common_data.affinity overwritten? Or
one of the pointers mangled. I doubt that it's the latter as this is 99%
generic code which would end up in random fails all over the place.

This also ends up in the wrong place. That mlx code does:

   af_desc.is_managed = false;

but the allocation ends up allocating a managed vector.

This screams memory corruption ....

Can you please instrument this along the call chain so we can see where
or at least when this gets corrupted? Please print the relevant pointer
addresses too so we can see whether that's consistent or not.

> The lowest 16 bits of MASKBITS are bcd8, or in binary:
>
> ... 1011 1100 1101 1000
>
> Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
> 12, matching the CPU IDs from the loop. At bit 12, we fault,
> since there is no CPU 12 on the system.

Thats due to a dubious optimization from Linus:

#if NR_CPUS <= BITS_PER_LONG
  #define small_cpumask_bits ((unsigned int)NR_CPUS)
  #define large_cpumask_bits ((unsigned int)NR_CPUS)
#elif NR_CPUS <= 4*BITS_PER_LONG
  #define small_cpumask_bits nr_cpu_ids

small_cpumask_bits is not nr_cpu_ids(12), it's NR_CPUS(32) which is why
the loop does not terminate. Bah!

But that's just the symptom, not the root cause. That code is perfectly
fine when all callers use the proper cpumask functions.

Thanks,

        tglx