Mathias, Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> writes: > I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes > the "Plug non-maskable MSI affinity race" patch. > > I can see this on a couple platforms, I'm running a script that first generates > a lot of usb traffic, and then in a busyloop sets irq affinity and turns off > and on cpus: > > for i in 1 3 5 7; do > echo "1" > /sys/devices/system/cpu/cpu$i/online > done > echo "A" > "/proc/irq/*/smp_affinity" > echo "A" > "/proc/irq/*/smp_affinity" > echo "F" > "/proc/irq/*/smp_affinity" > for i in 1 3 5 7; do > echo "0" > /sys/devices/system/cpu/cpu$i/online > done > trace snippet: > <idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq > <idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq > <idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq > <idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq > <...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6 Looks like a regular affinity setting in interrupt context, but I can't make sense of the time stamps > <idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq > <idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq > <idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq > migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0 So this is a CPU offline operation and after that irq 122 is silent, right? > kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > <idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 That kworker context looks fishy. Can you please enable stacktraces in the tracer so I can see the call chains leading to this? OTOH that's irq 121 not 122. Anyway moar information is always useful. And please add the patch below. Thanks, tglx 8<--------------- --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -243,6 +243,7 @@ u64 arch_irq_stat(void) RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); desc = __this_cpu_read(vector_irq[vector]); + trace_printk("vector: %u desc %lx\n", vector, (unsigned long) desc); if (likely(!IS_ERR_OR_NULL(desc))) { if (IS_ENABLED(CONFIG_X86_32)) handle_irq(desc, regs);