On Thu, 23 Apr 2020, Sebastian Andrzej Siewior wrote: > The IRQ4 is edge and in charge of ttyS0. It is handled by > handle_edge_irq() and after ->irq_ack(), the thread is woken up and then > we get another ->handle_edge_irq() for IRQ4. With larger PASS_LIMIT the > thread runs longer so note_interrupt() will make less IRQ_HANDLED based > on ->threads_handled_last. If it observes 100 handled within 100000 > interrupts then the counters are reset again. On !RT it usually manages > to get >100 per 100000 interrupts so it appears good. On RT it gets less > and the interrupt gets disabled. > > So it is not RT related, but RT triggers it more reliably (also the > PASS_LIMIT change can vanish). I can't tell if this is a qemu bug in > emulating the HW or not. I can't reproduce it real HW. I see a second > edge interrupt only after the thread completed. I can't tell if this is > because it is a real UART and the data is flowing slower or because the > edge-IRQ is not triggered repeatedly. Yeah, it's all strange. In the hope of understanding the issue a little bit better, I tried to disable IRQs in serial8250_handle_irq() to mimic what !PREEMPT_RT spinlock would do; the idea was that this is some kind of strange race / memory ordering (missed ack?) between the threaded irq4 handler and the do_IRQ() -> handle_edge_irq() -> ... path. So I did this: --- drivers/tty/serial/8250/8250_port.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index e31217e8dce6..1a577305e174 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -1813,12 +1813,13 @@ static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir) int serial8250_handle_irq(struct uart_port *port, unsigned int iir) { unsigned char status; - unsigned long flags; + unsigned long flags, f; struct uart_8250_port *up = up_to_u8250p(port); if (iir & UART_IIR_NO_INT) return 0; + local_irq_save(f); spin_lock_irqsave(&port->lock, flags); status = serial_port_in(port, UART_LSR); @@ -1833,6 +1834,7 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir) serial8250_tx_chars(up); uart_unlock_and_check_sysrq(port, flags); + local_irq_restore(f); return 1; } But curiously enough, that exploded in the oposite order (so first there was CPU stall, and only later the disabling of IRQ4 due to spurious storm): [ 107.308525] 000: rcu: INFO: rcu_preempt self-detected stall on CPU [ 107.308542] 000: rcu: 0-....: (20985 ticks this GP) idle=b2a/1/0x4000000000000002 softirq=0/0 fqs=4841 [ 107.308563] 000: (t=21000 jiffies g=16217 q=22343) [ 107.308565] 000: NMI backtrace for cpu 0 [ 107.308570] 000: CPU: 0 PID: 1213 Comm: irq/4-ttyS0 Not tainted 5.6.4-rt19+ #11 [ 107.308577] 000: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 [ 107.308579] 000: Call Trace: [ 107.308599] 000: <IRQ> [ 107.308608] 000: dump_stack+0x50/0x6b [ 107.308630] 000: nmi_cpu_backtrace+0x8a/0x90 [ 107.308639] 000: ? lapic_can_unplug_cpu+0xd0/0xd0 [ 107.308644] 000: nmi_trigger_cpumask_backtrace+0xe9/0x130 [ 107.308647] 000: rcu_dump_cpu_stacks+0x8e/0xbc [ 107.308658] 000: rcu_sched_clock_irq+0x70a/0xad0 [ 107.308661] 000: ? cpuacct_account_field+0x16/0x70 [ 107.308667] 000: ? tick_sched_do_timer+0x60/0x60 [ 107.308675] 000: update_process_times+0x1f/0x40 [ 107.308680] 000: tick_sched_handle+0x2f/0x40 [ 107.308685] 000: tick_sched_timer+0x3e/0x80 [ 107.308687] 000: __hrtimer_run_queues+0x11e/0x3c0 [ 107.308691] 000: hrtimer_interrupt+0xea/0x250 [ 107.308694] 000: smp_apic_timer_interrupt+0x84/0x200 [ 107.308704] 000: apic_timer_interrupt+0xf/0x20 [ 107.308709] 000: </IRQ> [ 107.308713] 000: RIP: 0010:serial8250_handle_irq.part.26+0x9e/0xd0 [ 107.308719] 000: Code: 48 85 c0 74 09 80 b8 e9 00 00 00 00 74 0f 41 83 e5 20 74 09 f6 83 36 02 00 00 02 75 2d 31 f6 48 89 df e8 65 87 ff ff 41 54 9d <b8> 01 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 40 08 48 89 df e8 [ 107.308723] 000: RSP: 0018:ffffb6ffc0a63e20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 107.308726] 000: RAX: ffff90b23734e040 RBX: ffffffff9f5a0de0 RCX: 0000000000000000 [ 107.308728] 000: RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffffff9f5a0de0 [ 107.308729] 000: RBP: 0000000000000060 R08: ffffffff9ea0b7a0 R09: ffffb6ffc0a63d78 [ 107.308730] 000: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000246 [ 107.308731] 000: R13: 0000000000000020 R14: 00000000000000c2 R15: ffff90b2ba711e58 [ 107.308739] 000: serial8250_default_handle_irq+0x45/0x50 [ 107.308741] 000: serial8250_interrupt+0x51/0xa0 [ 107.308749] 000: ? irq_finalize_oneshot.part.46+0xd0/0xd0 [ 107.308755] 000: irq_forced_thread_fn+0x2b/0x70 [ 107.308758] 000: irq_thread+0xd8/0x170 [ 107.308760] 000: ? wake_threads_waitq+0x30/0x30 [ 107.308761] 000: ? irq_thread_dtor+0x90/0x90 [ 107.308763] 000: kthread+0x10e/0x130 [ 107.308769] 000: ? kthread_park+0x80/0x80 [ 107.308770] 000: ret_from_fork+0x35/0x40 [ 165.360370] 000: irq 4: nobody cared (try booting with the "irqpoll" option) [ 165.360392] 000: CPU: 0 PID: 1213 Comm: irq/4-ttyS0 Not tainted 5.6.4-rt19+ #11 [ 165.360398] 000: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 [ 165.360599] 000: Call Trace: [ 165.361639] 000: <IRQ> [ 165.361651] 000: dump_stack+0x50/0x6b [ 165.361670] 000: __report_bad_irq+0x2b/0xb0 [ 165.361678] 000: note_interrupt+0x22e/0x280 [ 165.361680] 000: ? serial8250_handle_irq.part.26+0x9e/0xd0 [ 165.361685] 000: handle_irq_event_percpu+0x6e/0x90 [ 165.361693] 000: handle_irq_event+0x48/0x90 [ 165.361695] 000: handle_edge_irq+0x95/0x1f0 [ 165.361700] 000: do_IRQ+0x6c/0x120 [ 165.361709] 000: common_interrupt+0xf/0xf [ 165.361719] 000: </IRQ> [ 165.361720] 000: RIP: 0010:serial8250_handle_irq.part.26+0x9e/0xd0 [ 165.361727] 000: Code: 48 85 c0 74 09 80 b8 e9 00 00 00 00 74 0f 41 83 e5 20 74 09 f6 83 36 02 00 00 02 75 2d 31 f6 48 89 df e8 65 87 ff ff 41 54 9d <b8> 01 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 40 08 48 89 df e8 [ 165.361731] 000: RSP: 0018:ffffb6ffc0a63e20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb [ 165.361732] 000: RAX: ffff90b23734e040 RBX: ffffffff9f5a0de0 RCX: 0000000000000001 [ 165.361736] 000: RDX: 0000000000000003 RSI: 0000000000000097 RDI: 00000000ffffffff [ 165.361737] 000: RBP: 0000000000000060 R08: ffffffff9ea0b7a0 R09: ffffb6ffc0a63d78 [ 165.361739] 000: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000246 [ 165.361740] 000: R13: 0000000000000020 R14: 00000000000000c2 R15: ffff90b2ba711e58 [ 165.361749] 000: serial8250_default_handle_irq+0x45/0x50 [ 165.361752] 000: serial8250_interrupt+0x51/0xa0 [ 165.361759] 000: ? irq_finalize_oneshot.part.46+0xd0/0xd0 [ 165.361762] 000: irq_forced_thread_fn+0x2b/0x70 [ 165.361767] 000: irq_thread+0xd8/0x170 [ 165.361771] 000: ? wake_threads_waitq+0x30/0x30 [ 165.361773] 000: ? irq_thread_dtor+0x90/0x90 [ 165.361775] 000: kthread+0x10e/0x130 [ 165.361780] 000: ? kthread_park+0x80/0x80 [ 165.361782] 000: ret_from_fork+0x35/0x40 [ 165.361787] 000: handlers: [ 165.361793] 000: [<00000000005f10ab>] irq_default_primary_handler threaded [<00000000853a4a7e>] serial8250_interrupt [ 165.361795] 000: Disabling IRQ #4 So this time the storm + stall apparently was triggered by the 'popf' coming from my added local_irq_restore(). It's really curious. I'd be happy to blame it on some qemu breakage, but it seems to be happening on any randomly chosen version so far. Thanks, -- Jiri Kosina SUSE Labs