Hi Joe, On 27/04/16 23:41, Joe Lawrence wrote: > Hello Mathias, Roger, Gabriel > > I've been chasing strange MSI / legacy IRQ behavior from xHCI for a > couple days and wanted to report a few things that may be effected by > Gabriel's recent "xhci: Cleanup only when releasing primary hcd" patch > (more on this at the bottom). > > After 8c24d6d7b09d "usb: xhci: stop everything on the first call to > xhci_stop" (v4.3+), there exists a device removal window that can later > leak interrupts and other device resources ordinarily cleaned up by > xhci_stop. > > The quick summary is that if the xhci_handshake call from xhci_halt > fails, it will set xhci->xhc_state |= XHCI_STATE_HALTED. Later when the > device is removed, xhci_stop keys off the xhc_state and will exit early > if XHCI_STATE_HALTED is set. > > This manifested itself on my Stratus machine while running repeated > device hotplug removals at various timing intervals during > initialization (for example 3, 6 or 12 seconds after hotplug add). I > didn't grab a backtrace of xhci_halt, so I'm not exactly sure why it was > called after xhci_run, but note that it is not only called from > xhci_stop, but also xhci_handle_command_timeout and others. > > The result was that a later device initialization couldn't allocate its > MSI interrupts -- after much instrumentation I finally traced it back to > the previous initialization / removal failing to free the MSI irqs it > requested. > > Here's the relevent snippet from added trace logging. > (ftmod_ioboard_event and ioboard_bringdown are functions from an > out-of-tree PCI hotplug driver for Stratus hardware platforms. These > routines will call into PCI API to add or remove a set of PCI devices so > I left them here to help mark device lifetimes.) > > Last successful MSI allocation > ============================== > > java-17858 [002] .... 3005.763238: ftmod_ioboard_event: adding bus 03 > java-17858 [009] .... 3009.780359: usb_add_hcd: request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8808518e53d8)) return=0 > java-17858 [004] .... 3009.808979: usb_add_hcd: request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff880482d992a8)) return=0 > java-17858 [008] .... 3009.828263: xhci_halt: xhci(ffff88083f5a8258) > java-17858 [008] .... 3009.828269: xhci_quiesce: xhci(ffff88083f5a8258) halted=1 > java-17858 [008] .... 3009.828273: xhci_quiesce: xhci(ffff88083f5a8258) cmd=0 > java-17858 [008] .... 3009.829614: xhci_setup_msix: xhci=ffff88083f5a8258 > java-17858 [008] .... 3009.829627: xhci_setup_msix: xhci=ffff88083f5a8258 Failed to enable MSI-X, ret=-22 > java-17858 [008] .... 3009.829635: xhci_run: xhci=ffff88083f5a8258 > java-17858 [008] .... 3009.830069: xhci_run: request_irq(119, ...) MSI desc=ffff88083ac32e68 ret=0 > java-17858 [007] .... 3010.009977: xhci_halt: xhci(ffff88083f5a8258) > > xhci_halt is called, perhaps from > xhci_handle_command_timeout? > > java-17858 [007] .... 3010.009985: xhci_quiesce: xhci(ffff88083f5a8258) halted=1 > java-17858 [007] .... 3010.009986: xhci_quiesce: xhci(ffff88083f5a8258) cmd=ffffffff > > cmd = ~0, so the hardware is gone! subsequent > call to xhci_handshake will detect this and return > -ENODEV, xhci_halt will then set > xhci->xhc_state |= XHCI_STATE_HALTED > > java-17858 [007] d... 3010.010381: xhci_irq: hcd(ffff88083f5a8008)->irq=0 (depth=0, count=300, unhandled=0), status==~0 > > xhci_irq notices the missing hardware, too. > > java-17858 [007] .... 3010.014476: xhci_stop: xhci(ffff88083f5a8258)->xhc_state=2 > > xhci_stop is called and exits early because > (xhci->xhc_state & XHCI_STATE_HALTED) > Notice no entries for xhci_free_irq! > > java-17858 [007] .... 3010.029101: ftmod_ioboard_event: adding bus 79 > kworker/u97:3-974 [009] .... 3013.264132: ioboard_bringdown: board 0 > kworker/u97:3-974 [009] .... 3013.264145: ioboard_bringdown: sbus=0, removing bus 03 > kworker/u97:3-974 [005] .... 3016.551113: ioboard_bringdown: sbus=1, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=2, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=3, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=4, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=5, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=6, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=7, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=8, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=9, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551124: ioboard_bringdown: sbus=10, pdev = (null) > kworker/u97:3-974 [005] .... 3016.551125: ioboard_bringdown: sbus=11, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=12, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=13, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=14, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=15, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=16, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=17, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=18, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=19, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=20, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=21, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=22, bs->TopLevelBridge[sbus] = 0 > kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=23, bs->TopLevelBridge[sbus] = 0 > > First failed MSI allocation > =========================== > > java-17858 [011] .... 3084.238393: ftmod_ioboard_event: adding bus 03 > java-17858 [005] .... 3088.071479: usb_add_hcd: request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8804f5cc2548)) return=0 > java-17858 [005] .... 3088.096260: usb_add_hcd: request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff8804f5cc6678)) return=0 > java-17858 [000] .N.. 3088.121585: xhci_halt: xhci(ffff880857730258) > java-17858 [000] .N.. 3088.121593: xhci_quiesce: xhci(ffff880857730258) halted=1 > java-17858 [000] .N.. 3088.121597: xhci_quiesce: xhci(ffff880857730258) cmd=0 > java-17858 [000] .... 3088.123387: xhci_setup_msix: xhci=ffff880857730258 > java-17858 [000] .... 3088.123400: xhci_setup_msix: xhci=ffff880857730258 Failed to enable MSI-X, ret=-22 > java-17858 [000] .... 3088.123408: xhci_run: xhci=ffff880857730258 > java-17858 [000] .... 3088.123486: msi_domain_alloc_irqs: virq = -17 > > #define EEXIST 17 /* File exists */ > > java-17858 [000] .... 3088.123495: pci_enable_msi_range: pci_msi_setup_msi_irqs ret=-28 > java-17858 [000] .... 3088.123497: xhci_run: xhci=ffff880857730258 failed to allocate MSI entry, ret=-28 > > #define ENOSPC 28 /* No space left on device */ > > java-17858 [000] .... 3088.123548: xhci_run: request_irq(10, ...) legacy desc=ffff88085f41bba8 ret=0 > <idle>-0 [000] d.H. 3088.388279: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=1, unhandled=0), status=0 (!STS_EINT) > <idle>-0 [000] d.H. 3088.396057: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=2, unhandled=0), status=0 (!STS_EINT) > <idle>-0 [000] d.H. 3088.396424: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=3, unhandled=0), status=0 (!STS_EINT) > ksoftirqd/0-3 [000] d.H. 3088.491807: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=4, unhandled=0), status=0 (!STS_EINT) > kworker/0:1H-984 [000] d.H. 3088.495645: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=5, unhandled=0), status=0 (!STS_EINT) > kworker/0:1H-984 [000] d.H. 3088.496136: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=6, unhandled=0), status=0 (!STS_EINT) > <idle>-0 [000] dNH. 3088.508039: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=8, unhandled=0), status=0 (!STS_EINT) > kworker/0:1H-984 [000] d.H. 3088.511258: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=9, unhandled=0), status=0 (!STS_EINT) > kworker/0:1H-984 [000] d.H. 3088.515058: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=10, unhandled=0), status=0 (!STS_EINT) > ksoftirqd/0-3 [000] d.h. 3088.528961: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=11, unhandled=0), status=0 (!STS_EINT) > ksoftirqd/0-3 [000] d.H. 3088.529620: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=12, unhandled=0), status=0 (!STS_EINT) > ksoftirqd/0-3 [000] d.H. 3088.532033: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=13, unhandled=0), status=0 (!STS_EINT) > kworker/0:2-28078 [000] d.H. 3088.538757: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=14, unhandled=0), status=0 (!STS_EINT) > kworker/0:2-28078 [000] d.H. 3088.539227: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=15, unhandled=0), status=0 (!STS_EINT) > kworker/0:2-28078 [000] dNH. 3088.542054: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=16, unhandled=0), status=0 (!STS_EINT) > kworker/0:1H-984 [000] d.h. 3088.548702: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=17, unhandled=0), status=0 (!STS_EINT) > kworker/0:1H-984 [000] d.h. 3088.549322: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=18, unhandled=0), status=0 (!STS_EINT) > > This continues until the device is removed later. > > I'm currently testing with reverting 8c24d6d7b09d "usb: xhci: stop > everything on the first call to xhci_stop" to see how that fares. > > In the meantime I was browsing recent linux-scsi archives and noticed > Gabriel's [PATCH v2] xhci: Cleanup only when releasing primary hcd: > > http://marc.info/?l=linux-usb&m=146134802313366&w=2 > > which looks like it modifies xhci_stop() to continue on to > xhci_cleanup_msix even if XHCI_STATE_HALTED (for primary host > controller's anyway). > > I think this may avoid the issue that I've been seeing and was wondering > if it would be safe to test this patch on top of 4.6.0-rc3? (An older > RC because my repro is very timing dependent, so moving the kernel > forward / backwards or adding excessive trace logging can make it hard > to debug.) Yes, please test it on top of v4.6-rc and I'm hoping it will be sent for stable as well. cheers, -roger -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html