Hello Mathias, Roger, Gabriel I've been chasing strange MSI / legacy IRQ behavior from xHCI for a couple days and wanted to report a few things that may be effected by Gabriel's recent "xhci: Cleanup only when releasing primary hcd" patch (more on this at the bottom). After 8c24d6d7b09d "usb: xhci: stop everything on the first call to xhci_stop" (v4.3+), there exists a device removal window that can later leak interrupts and other device resources ordinarily cleaned up by xhci_stop. The quick summary is that if the xhci_handshake call from xhci_halt fails, it will set xhci->xhc_state |= XHCI_STATE_HALTED. Later when the device is removed, xhci_stop keys off the xhc_state and will exit early if XHCI_STATE_HALTED is set. This manifested itself on my Stratus machine while running repeated device hotplug removals at various timing intervals during initialization (for example 3, 6 or 12 seconds after hotplug add). I didn't grab a backtrace of xhci_halt, so I'm not exactly sure why it was called after xhci_run, but note that it is not only called from xhci_stop, but also xhci_handle_command_timeout and others. The result was that a later device initialization couldn't allocate its MSI interrupts -- after much instrumentation I finally traced it back to the previous initialization / removal failing to free the MSI irqs it requested. Here's the relevent snippet from added trace logging. (ftmod_ioboard_event and ioboard_bringdown are functions from an out-of-tree PCI hotplug driver for Stratus hardware platforms. These routines will call into PCI API to add or remove a set of PCI devices so I left them here to help mark device lifetimes.) Last successful MSI allocation ============================== java-17858 [002] .... 3005.763238: ftmod_ioboard_event: adding bus 03 java-17858 [009] .... 3009.780359: usb_add_hcd: request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8808518e53d8)) return=0 java-17858 [004] .... 3009.808979: usb_add_hcd: request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff880482d992a8)) return=0 java-17858 [008] .... 3009.828263: xhci_halt: xhci(ffff88083f5a8258) java-17858 [008] .... 3009.828269: xhci_quiesce: xhci(ffff88083f5a8258) halted=1 java-17858 [008] .... 3009.828273: xhci_quiesce: xhci(ffff88083f5a8258) cmd=0 java-17858 [008] .... 3009.829614: xhci_setup_msix: xhci=ffff88083f5a8258 java-17858 [008] .... 3009.829627: xhci_setup_msix: xhci=ffff88083f5a8258 Failed to enable MSI-X, ret=-22 java-17858 [008] .... 3009.829635: xhci_run: xhci=ffff88083f5a8258 java-17858 [008] .... 3009.830069: xhci_run: request_irq(119, ...) MSI desc=ffff88083ac32e68 ret=0 java-17858 [007] .... 3010.009977: xhci_halt: xhci(ffff88083f5a8258) xhci_halt is called, perhaps from xhci_handle_command_timeout? java-17858 [007] .... 3010.009985: xhci_quiesce: xhci(ffff88083f5a8258) halted=1 java-17858 [007] .... 3010.009986: xhci_quiesce: xhci(ffff88083f5a8258) cmd=ffffffff cmd = ~0, so the hardware is gone! subsequent call to xhci_handshake will detect this and return -ENODEV, xhci_halt will then set xhci->xhc_state |= XHCI_STATE_HALTED java-17858 [007] d... 3010.010381: xhci_irq: hcd(ffff88083f5a8008)->irq=0 (depth=0, count=300, unhandled=0), status==~0 xhci_irq notices the missing hardware, too. java-17858 [007] .... 3010.014476: xhci_stop: xhci(ffff88083f5a8258)->xhc_state=2 xhci_stop is called and exits early because (xhci->xhc_state & XHCI_STATE_HALTED) Notice no entries for xhci_free_irq! java-17858 [007] .... 3010.029101: ftmod_ioboard_event: adding bus 79 kworker/u97:3-974 [009] .... 3013.264132: ioboard_bringdown: board 0 kworker/u97:3-974 [009] .... 3013.264145: ioboard_bringdown: sbus=0, removing bus 03 kworker/u97:3-974 [005] .... 3016.551113: ioboard_bringdown: sbus=1, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=2, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=3, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=4, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=5, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=6, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=7, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=8, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=9, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551124: ioboard_bringdown: sbus=10, pdev = (null) kworker/u97:3-974 [005] .... 3016.551125: ioboard_bringdown: sbus=11, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=12, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=13, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=14, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=15, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=16, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=17, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=18, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=19, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=20, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=21, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=22, bs->TopLevelBridge[sbus] = 0 kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=23, bs->TopLevelBridge[sbus] = 0 First failed MSI allocation =========================== java-17858 [011] .... 3084.238393: ftmod_ioboard_event: adding bus 03 java-17858 [005] .... 3088.071479: usb_add_hcd: request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8804f5cc2548)) return=0 java-17858 [005] .... 3088.096260: usb_add_hcd: request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff8804f5cc6678)) return=0 java-17858 [000] .N.. 3088.121585: xhci_halt: xhci(ffff880857730258) java-17858 [000] .N.. 3088.121593: xhci_quiesce: xhci(ffff880857730258) halted=1 java-17858 [000] .N.. 3088.121597: xhci_quiesce: xhci(ffff880857730258) cmd=0 java-17858 [000] .... 3088.123387: xhci_setup_msix: xhci=ffff880857730258 java-17858 [000] .... 3088.123400: xhci_setup_msix: xhci=ffff880857730258 Failed to enable MSI-X, ret=-22 java-17858 [000] .... 3088.123408: xhci_run: xhci=ffff880857730258 java-17858 [000] .... 3088.123486: msi_domain_alloc_irqs: virq = -17 #define EEXIST 17 /* File exists */ java-17858 [000] .... 3088.123495: pci_enable_msi_range: pci_msi_setup_msi_irqs ret=-28 java-17858 [000] .... 3088.123497: xhci_run: xhci=ffff880857730258 failed to allocate MSI entry, ret=-28 #define ENOSPC 28 /* No space left on device */ java-17858 [000] .... 3088.123548: xhci_run: request_irq(10, ...) legacy desc=ffff88085f41bba8 ret=0 <idle>-0 [000] d.H. 3088.388279: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=1, unhandled=0), status=0 (!STS_EINT) <idle>-0 [000] d.H. 3088.396057: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=2, unhandled=0), status=0 (!STS_EINT) <idle>-0 [000] d.H. 3088.396424: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=3, unhandled=0), status=0 (!STS_EINT) ksoftirqd/0-3 [000] d.H. 3088.491807: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=4, unhandled=0), status=0 (!STS_EINT) kworker/0:1H-984 [000] d.H. 3088.495645: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=5, unhandled=0), status=0 (!STS_EINT) kworker/0:1H-984 [000] d.H. 3088.496136: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=6, unhandled=0), status=0 (!STS_EINT) <idle>-0 [000] dNH. 3088.508039: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=8, unhandled=0), status=0 (!STS_EINT) kworker/0:1H-984 [000] d.H. 3088.511258: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=9, unhandled=0), status=0 (!STS_EINT) kworker/0:1H-984 [000] d.H. 3088.515058: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=10, unhandled=0), status=0 (!STS_EINT) ksoftirqd/0-3 [000] d.h. 3088.528961: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=11, unhandled=0), status=0 (!STS_EINT) ksoftirqd/0-3 [000] d.H. 3088.529620: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=12, unhandled=0), status=0 (!STS_EINT) ksoftirqd/0-3 [000] d.H. 3088.532033: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=13, unhandled=0), status=0 (!STS_EINT) kworker/0:2-28078 [000] d.H. 3088.538757: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=14, unhandled=0), status=0 (!STS_EINT) kworker/0:2-28078 [000] d.H. 3088.539227: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=15, unhandled=0), status=0 (!STS_EINT) kworker/0:2-28078 [000] dNH. 3088.542054: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=16, unhandled=0), status=0 (!STS_EINT) kworker/0:1H-984 [000] d.h. 3088.548702: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=17, unhandled=0), status=0 (!STS_EINT) kworker/0:1H-984 [000] d.h. 3088.549322: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=18, unhandled=0), status=0 (!STS_EINT) This continues until the device is removed later. I'm currently testing with reverting 8c24d6d7b09d "usb: xhci: stop everything on the first call to xhci_stop" to see how that fares. In the meantime I was browsing recent linux-scsi archives and noticed Gabriel's [PATCH v2] xhci: Cleanup only when releasing primary hcd: http://marc.info/?l=linux-usb&m=146134802313366&w=2 which looks like it modifies xhci_stop() to continue on to xhci_cleanup_msix even if XHCI_STATE_HALTED (for primary host controller's anyway). I think this may avoid the issue that I've been seeing and was wondering if it would be safe to test this patch on top of 4.6.0-rc3? (An older RC because my repro is very timing dependent, so moving the kernel forward / backwards or adding excessive trace logging can make it hard to debug.) Regards, -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html