On Mon, Nov 18 2024 at 22:15, Leonardo Bras wrote: > On Thu, Nov 14, 2024 at 09:50:36AM +0200, Andy Shevchenko wrote: >> - the IRQ core disables IRQ while handling an IRQ number in question; > > Not necessarily: > When on irqs are force-threaded, only a quick handler is called, returning > IRQ_WAKE_THREAD, which is supposed to wake up the handler thread. > > * @IRQ_WAKE_THREAD: handler requests to wake the handler thread > > In this case (which is what I am dealing with), the actual handler will run > in thread context (which I suppose don't disable IRQ for sched-out > purposes). Let's talk about the disable irq concepts here. There are 3 distinct variants: 1) Disable interrupts on the local CPU: local_irq_disable/save(). This only prevents the CPU from handling a pending interrupt, but does not prevent new interrupts from being marked pending in the interrupt controller. 2) Disable interrupts in the interrupt controller disable_irq*() variants handle that. They come with two flavors: LAZY and UNLAZY LAZY does not mask the interrupt line right away. It only is masked when an interrupt arrives while the line is marked "disabled". This then sets the PENDING bit, which in turn causes a replay of the interrupt once enable_irq() is bringing the disabled count down to zero. LAZY has two purposes: A) Prevent losing edge type interrupts B) Optimization to avoid the maybe costly hardware access UNLAZY masks the interrupt right away. 3) Disable interrupts at the device level This prevents the device from raising interrupts. Now there is another twist with force threaded interrupts. If the interrupt is level triggered the interrupt line is masked in the hard interrupt handler and unmasked when the thread returns. For edge type interrupts that's handled differently in order to not lose an edge. The line is not masked, so that if the device raises an interrupt before the threaded handler returns, the threaded handler is marked runnable again. That again comes in two flavors: 1) Non-RT kernel That cannot result in a scenario where the force threaded handler loops and interrupts keep arriving at the CPU because the CPU has interrupts disabled across the force threaded handler and at least on x86 and arm64 that's the CPU which is target of the hardware interrupt. So at max there can come a few interrupts between the first interrupt and the threaded handler being scheduled in, but I doubt it can be more than three. 2) RT kernel The forced threaded handler runs with CPU interrupts enabled, so when the threaded handler deals with the device, new interrupts can come in at the hardware level and are accounted for, which is what you are trying to address, right? So the total amount of TX bytes, i.e. PASS_LIMIT * tx_loadsz, becomes large enough to trigger the irq storm detector, right? That obviously begs two questions: 1) Why do you want to deal with this interrupt flood at the storm detector and not at the root cause of the whole thing? It does not make any sense to take a gazillion of interrupts for nothing and then work around the resulting detector fallout. The obvious adhoc cure is to add this to serial8250_interrupt(): if (IS_ENABLED(CONFIG_PREEMPT_RT)) disable_irq_nosync(irq); magic_loop(); if (IS_ENABLED(CONFIG_PREEMPT_RT)) enable_irq(irq); Ideally we'd disable interrupt generation in the IER register around the loop, but that's another can of worms as this can't be done easily without holding port lock across the IER disabled section. Also see below. 2) Is the 512 iterations loop (PASS_LIMIT) still appropriate? That loop in serial8250_interrupt() looks pretty horrible even on a non-RT kernel with an emulated UART. The issue is mostly irrelevant on real hardware as the FIFO writes are faster than the UART can empty the FIFO. So unless there are two FIFO UARTs sharing the same interrupt line _and_ writing large amount of data at the same time the loop will terminate after one write cycle. There won't be an interrupt per TX byte either. On emulated hardware this is very different because the write causes a VMEXIT with a full round trip to user space, which consumes the character and immediately raises the TX empty interrupt again because there is no "slow" hardware involved. With the default qemu FIFO depth of 16 bytes, this results in up to 512 * 16 = 8192 bytes written out with interrupts disabled for one invocation of the interrupt handler (threaded or not)... I just traced interrupt entry/exit in a VM and for a file with 256k bytes written to /dev/ttyS0 this results in: Interrupts: 35 Bytes/Interrupt: ~7490 Tmax: 104725 us Tavg: 96778 us So one interrupt handler invocation which writes up to 8k takes roughly 100ms with interrupts disabled... Now looking at the host side. For every write to the TX FIFO there are _four_ invocations of kvm_set_irq() originating from kvm_vm_ioctl_irq_line(): CPU 36/KVM-2063 [097] ..... 1466862.728737: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xd CPU 36/KVM-2063 [097] ..... 1466862.728742: kvm_set_irq: gsi 4 level 0 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728749: kvm_set_irq: gsi 4 level 0 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728754: kvm_set_irq: gsi 4 level 1 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728762: kvm_set_irq: gsi 4 level 1 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728783: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xa CPU 36/KVM-2063 [097] ..... 1466862.728787: kvm_set_irq: gsi 4 level 0 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728792: kvm_set_irq: gsi 4 level 0 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728797: kvm_set_irq: gsi 4 level 1 source 0 CPU 36/KVM-2063 [097] ..... 1466862.728809: kvm_set_irq: gsi 4 level 1 source 0 No idea why this needs four ioctls and not only two, but this clearly demonstrates that each byte written creates an edge interrupt. No surprise that the storm detector triggers on RT. But this also makes it clear why it's a patently bad idea to work around this in the detector... Now being "smart" and disabling THRI in the IER before the write loop in serial8250_tx_chars() changes the picture to: CPU 18/KVM-2045 [092] ..... 1466230.357478: kvm_pio: pio_write at 0x3f9 size 1 count 1 val 0x5 IER.THRI is cleared now so it's expected that nothing fiddles with the interrupt line anymore, right? But .... CPU 18/KVM-2045 [092] ..... 1466230.357479: kvm_set_irq: gsi 4 level 0 source 0 CPU 18/KVM-2045 [092] ..... 1466230.357481: kvm_set_irq: gsi 4 level 0 source 0 CPU 18/KVM-2045 [092] ..... 1466230.357484: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0x73 CPU 18/KVM-2045 [092] ..... 1466230.357485: kvm_set_irq: gsi 4 level 0 source 0 CPU 18/KVM-2045 [092] ..... 1466230.357487: kvm_set_irq: gsi 4 level 0 source 0 CPU 18/KVM-2045 [092] ..... 1466230.357489: kvm_set_irq: gsi 4 level 0 source 0 CPU 18/KVM-2045 [092] ..... 1466230.357491: kvm_set_irq: gsi 4 level 0 source 0 .... So every write to the TX FIFO sets the level to 0 four times in a row to not create an edge. That's truly impressive! Thanks, tglx