On Fri, Mar 24, 2023 at 05:01:28PM +0000, Catalin Marinas wrote: > On Fri, Mar 24, 2023 at 08:43:38AM +0000, Bouska, Zdenek wrote: > > I have seen ~3 ms delay in interrupt handling on ARM64. > > > > I have traced it down to raw_spin_lock() call in handle_irq_event() in > > kernel/irq/handle.c: > > > > irqreturn_t handle_irq_event(struct irq_desc *desc) > > { > > irqreturn_t ret; > > > > desc->istate &= ~IRQS_PENDING; > > irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS); > > raw_spin_unlock(&desc->lock); > > > > ret = handle_irq_event_percpu(desc); > > > > --> raw_spin_lock(&desc->lock); > > irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS); > > return ret; > > } > > > > It took ~3 ms for this raw_spin_lock() to lock. > > That's quite a large indeed. > > > During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and > > unlocks the same raw spin lock more than 1000 times: > > > > static void irq_finalize_oneshot(struct irq_desc *desc, > > struct irqaction *action) > > { > > if (!(desc->istate & IRQS_ONESHOT) || > > action->handler == irq_forced_secondary_handler) > > return; > > again: > > chip_bus_lock(desc); > > --> raw_spin_lock_irq(&desc->lock); > > > > /* > > * Implausible though it may be we need to protect us against > > * the following scenario: > > * > > * The thread is faster done than the hard interrupt handler > > * on the other CPU. If we unmask the irq line then the > > * interrupt can come in again and masks the line, leaves due > > * to IRQS_INPROGRESS and the irq line is masked forever. > > * > > * This also serializes the state of shared oneshot handlers > > * versus "desc->threads_oneshot |= action->thread_mask;" in > > * irq_wake_thread(). See the comment there which explains the > > * serialization. > > */ > > if (unlikely(irqd_irq_inprogress(&desc->irq_data))) { > > --> raw_spin_unlock_irq(&desc->lock); > > chip_bus_sync_unlock(desc); > > cpu_relax(); > > goto again; > > } > > So this path is hammering the desc->lock location and another CPU cannot > change it. As you found, the problem is not the spinlock algorithm but > the atomic primitives. The LDXR/STXR constructs on arm64 are known to > have this issue with STXR failing indefinitely. raw_spin_unlock() simply > does an STLR and this clears the exclusive monitor that the other CPU > may have set with LDXR but before the STXR. The queued spinlock only > provides fairness if the CPU manages to get in the queue. > > > So I confirmed that atomic operations from > > arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are > > contested from second CPU. > > > > Do you think that it is possible to create fair qspinlock implementation > > on top of atomic instructions supported by ARM64 version 8 (no LSE atomic > > instructions) without compromising performance in the uncontested case? > > For example ARM64 could have custom queued_fetch_set_pending_acquire > > implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the > > retry loop in irq_finalize_oneshot() ok together with the current ARM64 > > cpu_relax() implementation for processor with no LSE atomic instructions? > > So is the queued_fetch_set_pending_acquire() where it gets stuck or the > earlier atomic_try_cmpxchg_acquire() before entering on the slow path? I > guess both can fail in a similar way. > > A longer cpu_relax() here would improve things (on arm64 this function > is a no-op) but maybe Thomas or Will have a better idea. I had a pretty gross cpu_relax() implementation using wfe somewhere on LKML, so you could try that if you can dig it up. Generally though, LDXR/STXR and realtime don't mix super well. Will