Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 02 Dec 2024 23:53:43 +0100

On Mon, Nov 18 2024 at 22:15, Leonardo Bras wrote:
> On Thu, Nov 14, 2024 at 09:50:36AM +0200, Andy Shevchenko wrote:
>> - the IRQ core disables IRQ while handling an IRQ number in question;
>
> Not necessarily:
> When on irqs are force-threaded, only a quick handler is called, returning 
> IRQ_WAKE_THREAD, which is supposed to wake up the handler thread.
>
>   * @IRQ_WAKE_THREAD:   handler requests to wake the handler thread
>
> In this case (which is what I am dealing with), the actual handler will run 
> in thread context (which I suppose don't disable IRQ for sched-out 
> purposes).

Let's talk about the disable irq concepts here. There are 3 distinct variants:

  1) Disable interrupts on the local CPU: local_irq_disable/save().

     This only prevents the CPU from handling a pending interrupt,
     but does not prevent new interrupts from being marked pending
     in the interrupt controller.

  2) Disable interrupts in the interrupt controller

     disable_irq*() variants handle that. They come with two
     flavors: LAZY and UNLAZY

     LAZY does not mask the interrupt line right away. It only is masked
     when an interrupt arrives while the line is marked "disabled". This
     then sets the PENDING bit, which in turn causes a replay of the
     interrupt once enable_irq() is bringing the disabled count down to
     zero. LAZY has two purposes:

       A) Prevent losing edge type interrupts

       B) Optimization to avoid the maybe costly hardware access

     UNLAZY masks the interrupt right away.

  3) Disable interrupts at the device level

     This prevents the device from raising interrupts.

Now there is another twist with force threaded interrupts.

If the interrupt is level triggered the interrupt line is masked in the
hard interrupt handler and unmasked when the thread returns.

For edge type interrupts that's handled differently in order to not lose
an edge. The line is not masked, so that if the device raises an
interrupt before the threaded handler returns, the threaded handler is
marked runnable again. That again comes in two flavors:

  1) Non-RT kernel

     That cannot result in a scenario where the force threaded handler
     loops and interrupts keep arriving at the CPU because the CPU has
     interrupts disabled across the force threaded handler and at least
     on x86 and arm64 that's the CPU which is target of the hardware
     interrupt.

     So at max there can come a few interrupts between the first
     interrupt and the threaded handler being scheduled in, but I doubt
     it can be more than three.

  2) RT kernel

     The forced threaded handler runs with CPU interrupts enabled, so
     when the threaded handler deals with the device, new interrupts can
     come in at the hardware level and are accounted for, which is what
     you are trying to address, right?

     So the total amount of TX bytes, i.e. PASS_LIMIT * tx_loadsz,
     becomes large enough to trigger the irq storm detector, right?

That obviously begs two questions:

  1) Why do you want to deal with this interrupt flood at the storm
     detector and not at the root cause of the whole thing?

     It does not make any sense to take a gazillion of interrupts for
     nothing and then work around the resulting detector fallout.

     The obvious adhoc cure is to add this to serial8250_interrupt():

        if (IS_ENABLED(CONFIG_PREEMPT_RT))
        	disable_irq_nosync(irq);

        magic_loop();

        if (IS_ENABLED(CONFIG_PREEMPT_RT))
        	enable_irq(irq);

     Ideally we'd disable interrupt generation in the IER register
     around the loop, but that's another can of worms as this can't be
     done easily without holding port lock across the IER disabled
     section. Also see below.

  2) Is the 512 iterations loop (PASS_LIMIT) still appropriate?

     That loop in serial8250_interrupt() looks pretty horrible even on a
     non-RT kernel with an emulated UART.

     The issue is mostly irrelevant on real hardware as the FIFO writes
     are faster than the UART can empty the FIFO. So unless there are
     two FIFO UARTs sharing the same interrupt line _and_ writing
     large amount of data at the same time the loop will terminate after
     one write cycle. There won't be an interrupt per TX byte either.

     On emulated hardware this is very different because the write
     causes a VMEXIT with a full round trip to user space, which
     consumes the character and immediately raises the TX empty
     interrupt again because there is no "slow" hardware involved.

     With the default qemu FIFO depth of 16 bytes, this results in up to
     512 * 16 = 8192 bytes written out with interrupts disabled for one
     invocation of the interrupt handler (threaded or not)...

     I just traced interrupt entry/exit in a VM and for a file with 256k
     bytes written to /dev/ttyS0 this results in:

     Interrupts:          35
     Bytes/Interrupt:  ~7490
     Tmax:            104725 us
     Tavg:             96778 us

     So one interrupt handler invocation which writes up to 8k takes
     roughly 100ms with interrupts disabled...

     Now looking at the host side. For every write to the TX FIFO there
     are _four_ invocations of kvm_set_irq() originating from kvm_vm_ioctl_irq_line():

      CPU 36/KVM-2063    [097] ..... 1466862.728737: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xd 
      CPU 36/KVM-2063    [097] ..... 1466862.728742: kvm_set_irq: gsi 4 level 0 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728749: kvm_set_irq: gsi 4 level 0 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728754: kvm_set_irq: gsi 4 level 1 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728762: kvm_set_irq: gsi 4 level 1 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728783: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xa 
      CPU 36/KVM-2063    [097] ..... 1466862.728787: kvm_set_irq: gsi 4 level 0 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728792: kvm_set_irq: gsi 4 level 0 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728797: kvm_set_irq: gsi 4 level 1 source 0
      CPU 36/KVM-2063    [097] ..... 1466862.728809: kvm_set_irq: gsi 4 level 1 source 0

     No idea why this needs four ioctls and not only two, but this
     clearly demonstrates that each byte written creates an edge
     interrupt. No surprise that the storm detector triggers on RT. But
     this also makes it clear why it's a patently bad idea to work
     around this in the detector...

     Now being "smart" and disabling THRI in the IER before the write
     loop in serial8250_tx_chars() changes the picture to:

      CPU 18/KVM-2045    [092] ..... 1466230.357478: kvm_pio: pio_write at 0x3f9 size 1 count 1 val 0x5

      IER.THRI is cleared now so it's expected that nothing fiddles with
      the interrupt line anymore, right? But ....

      CPU 18/KVM-2045    [092] ..... 1466230.357479: kvm_set_irq: gsi 4 level 0 source 0
      CPU 18/KVM-2045    [092] ..... 1466230.357481: kvm_set_irq: gsi 4 level 0 source 0
      CPU 18/KVM-2045    [092] ..... 1466230.357484: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0x73 
      CPU 18/KVM-2045    [092] ..... 1466230.357485: kvm_set_irq: gsi 4 level 0 source 0
      CPU 18/KVM-2045    [092] ..... 1466230.357487: kvm_set_irq: gsi 4 level 0 source 0
      CPU 18/KVM-2045    [092] ..... 1466230.357489: kvm_set_irq: gsi 4 level 0 source 0
      CPU 18/KVM-2045    [092] ..... 1466230.357491: kvm_set_irq: gsi 4 level 0 source 0
      ....

     So every write to the TX FIFO sets the level to 0 four times in a
     row to not create an edge. That's truly impressive!

Thanks,

        tglx