On 20.3.2020 11.52, Thomas Gleixner wrote: > Mathias, > > Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> writes: >> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes >> the "Plug non-maskable MSI affinity race" patch. >> >> I can see this on a couple platforms, I'm running a script that first generates >> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off >> and on cpus: >> >> for i in 1 3 5 7; do >> echo "1" > /sys/devices/system/cpu/cpu$i/online >> done >> echo "A" > "/proc/irq/*/smp_affinity" >> echo "A" > "/proc/irq/*/smp_affinity" >> echo "F" > "/proc/irq/*/smp_affinity" >> for i in 1 3 5 7; do >> echo "0" > /sys/devices/system/cpu/cpu$i/online >> done >> trace snippet: >> <idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq >> <idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq >> <idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq >> <idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq >> <...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6 > > Looks like a regular affinity setting in interrupt context, but I can't > make sense of the time stamps I think so, everything worked normally after this one still. > >> <idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq >> <idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq >> <idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq >> migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0 > > So this is a CPU offline operation and after that irq 122 is silent, right? Yes, after this irq 122 was silent. > >> kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> <idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > > That kworker context looks fishy. Can you please enable stacktraces in > the tracer so I can see the call chains leading to this? OTOH that's irq > 121 not 122. Anyway moar information is always useful. > > And please add the patch below. > Full function trace with patch is huge, can be found compressed at https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing xhci_traces.tgz contains: trace_full: full function trace. trace: timestamp ~48.29 to ~48.93 of trace above, section with last xhci irq trace_prink_only: only trace_printk() of "trace" above This time xhci interrupts stopped after migration/3-24 [003] d..1 48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4 Thanks -Mathias