Hello, I have seen ~3 ms delay in interrupt handling on ARM64. I have traced it down to raw_spin_lock() call in handle_irq_event() in kernel/irq/handle.c: irqreturn_t handle_irq_event(struct irq_desc *desc) { irqreturn_t ret; desc->istate &= ~IRQS_PENDING; irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS); raw_spin_unlock(&desc->lock); ret = handle_irq_event_percpu(desc); --> raw_spin_lock(&desc->lock); irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS); return ret; } It took ~3 ms for this raw_spin_lock() to lock. During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and unlocks the same raw spin lock more than 1000 times: static void irq_finalize_oneshot(struct irq_desc *desc, struct irqaction *action) { if (!(desc->istate & IRQS_ONESHOT) || action->handler == irq_forced_secondary_handler) return; again: chip_bus_lock(desc); --> raw_spin_lock_irq(&desc->lock); /* * Implausible though it may be we need to protect us against * the following scenario: * * The thread is faster done than the hard interrupt handler * on the other CPU. If we unmask the irq line then the * interrupt can come in again and masks the line, leaves due * to IRQS_INPROGRESS and the irq line is masked forever. * * This also serializes the state of shared oneshot handlers * versus "desc->threads_oneshot |= action->thread_mask;" in * irq_wake_thread(). See the comment there which explains the * serialization. */ if (unlikely(irqd_irq_inprogress(&desc->irq_data))) { --> raw_spin_unlock_irq(&desc->lock); chip_bus_sync_unlock(desc); cpu_relax(); goto again; } ... I have created a workaround for this problem by calling cpu_relax() 50 times after 100 failed tries. See attached patch 3ms_tx_delay_workaround.patch. I have created custom kernel module with 2 threads, one similar to irq_finalize_oneshot(), second similar to handle_irq_event(). I have used latest Linux 6.3-rc3 with no added patches and I confirmed that even there qspinlocks are not fair on my ARM64 board. I copied qspinlocks code to the module twice and I have put traces only to one thread, the one which takes several ms to lock and is originally called from handle_irq_event(). I have found out that the queued_fetch_set_pending_acquire() takes those 3 ms to finish. On ARM64 queued_fetch_set_pending_acquire() is implemented as atomic_fetch_or_acquire(). I have found out that my CPU doesn't know LSE atomic instructions and it looks like atomic operations could be quite slow there. Assembler code in arch/arm64/include/asm/atomic_ll_sc.h has loop inside: #define ATOMIC_FETCH_OP(name, mb, acq, rel, cl, op, asm_op, constraint) \ static __always_inline int \ __ll_sc_atomic_fetch_##op##name(int i, atomic_t *v) \ { \ unsigned long tmp; \ int val, result; \ \ asm volatile("// atomic_fetch_" #op #name "\n" \ " prfm pstl1strm, %3\n" \ "1: ld" #acq "xr %w0, %3\n" \ " " #asm_op " %w1, %w0, %w4\n" \ " st" #rel "xr %w2, %w1, %3\n" \ --> " cbnz %w2, 1b\n" \ " " #mb \ : "=&r" (result), "=&r" (val), "=&r" (tmp), "+Q" (v->counter) \ : __stringify(constraint) "r" (i) \ : cl); \ \ return result; \ } Most importantly, these atomic operations seem to make one CPU dominate the cache line so that the other is unable to take the lock. And that is problematic in combination with the retry loop in irq_finalize_oneshot(). To confirm it I have created small userspace program, which just calls __ll_sc_atomic_fetch_or_acquire() from two threads. See attached unfair_arm64_asm_atomic_ll_sc_demonstration.tar.gz. Bellow you can see that it took 16 ms for one atomic operation. # ./contested load thread started evaluation thread started new max duration: 6420 ns new max duration: 9355 ns new max duration: 22240 ns new max duration: 23180 ns new max duration: 70465 ns new max duration: 77860 ns new max duration: 83100 ns new max duration: 105115 ns new max duration: 127695 ns new max duration: 128840 ns new max duration: 1265595 ns new max duration: 3713430 ns new max duration: 3750810 ns new max duration: 7996020 ns new max duration: 7998890 ns new max duration: 7999340 ns new max duration: 7999490 ns new max duration: 12000210 ns new max duration: 15999700 ns new max duration: 16000000 ns new max duration: 16000030 ns So I confirmed that atomic operations from arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are contested from second CPU. Do you think that it is possible to create fair qspinlock implementation on top of atomic instructions supported by ARM64 version 8 (no LSE atomic instructions) without compromising performance in the uncontested case? For example ARM64 could have custom queued_fetch_set_pending_acquire implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the retry loop in irq_finalize_oneshot() ok together with the current ARM64 cpu_relax() implementation for processor with no LSE atomic instructions? I reproduced the real life scenario of TX delay only in ICSSG network driver (not yet merged to mainline) [1], it was with kernel 5.10 with patches, CONFIG_PREEMPT_RT and custom ICSSG firmware on Texas Instruments AM65x IDK [2] with ARM Cortex A53. This custom setup comes with high interrupt load. [1] https://lore.kernel.org/all/20220406094358.7895-1-p-mohan@xxxxxx/ [2] https://www.ti.com/tool/TMDX654IDKEVM With best regards, Zdenek Bouska -- Siemens, s.r.o Siemens Advanta Development
Attachment:
unfair_arm64_asm_atomic_ll_sc_demonstration.tar.gz
Description: unfair_arm64_asm_atomic_ll_sc_demonstration.tar.gz
Attachment:
3ms_tx_delay_workaround.patch
Description: 3ms_tx_delay_workaround.patch