Re: [PATCH 6.10 000/809] 6.10.3-rc3 review

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 05 Aug 2024 10:56:01 +0200

On Sun, Aug 04 2024 at 20:28, Guenter Roeck wrote:
> On 8/4/24 11:36, Guenter Roeck wrote:
>>> Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>>>      genirq: Set IRQF_COND_ONESHOT in request_irq()
>>>
>> 
>> With this patch in v6.10.3, all my parisc64 qemu tests get stuck with repeated error messages
>> 
>> [    0.000000] =============================================================================
>> [    0.000000] BUG kmem_cache_node (Not tainted): objects 21 > max 16
>> [    0.000000] -----------------------------------------------------------------------------

Do you have a full boot log? It's unclear to me at which point of the boot
process this happens. Is this before or after the secondary CPUs have
been brought up?

>> This never stops until the emulation aborts.

Do you have a recipe how to reproduce?

>> Reverting this patch fixes the problem for me.
>> 
>> I noticed a similar problem in the mainline kernel but it is either spurious there
>> or the problem has been fixed.
>> 
>
> As a follow-up, the patch below (on top of v6.10.3) "fixes" the problem for me.
> I guess that suggests some kind of race condition.
>
>
> @@ -2156,6 +2157,8 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
>          struct irq_desc *desc;
>          int retval;
>
> +       udelay(1);
> +
>          if (irq == IRQ_NOTCONNECTED)
>                  return -ENOTCONN;

That all makes absolutely no sense to me.

IRQF_COND_ONESHOT has only an effect on shared interrupts, when the
interrupt was already requested with IRQF_ONESHOT.

If this is really a race then the following must be true:

1) no delay

   CPU0                                 CPU1
   request_irq(IRQF_ONESHOT)
                                        request_irq(IRQF_COND_ONESHOT)

2) delay

   CPU0                                 CPU1
                                        request_irq(IRQF_COND_ONESHOT)
   request_irq(IRQF_ONESHOT)

   In this case the request on CPU 0 fails with -EBUSY ...

Confused

        tglx