Re: [PATCH v2] xhci: Cleanup only when releasing primary hcd.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Mathias, Roger, Gabriel

I've been chasing strange MSI / legacy IRQ behavior from xHCI for a
couple days and wanted to report a few things that may be effected by
Gabriel's recent "xhci: Cleanup only when releasing primary hcd" patch
(more on this at the bottom).

After 8c24d6d7b09d "usb: xhci: stop everything on the first call to
xhci_stop" (v4.3+), there exists a device removal window that can later
leak interrupts and other device resources ordinarily cleaned up by
xhci_stop.

The quick summary is that if the xhci_handshake call from xhci_halt
fails, it will set xhci->xhc_state |= XHCI_STATE_HALTED.  Later when the
device is removed, xhci_stop keys off the xhc_state and will exit early
if XHCI_STATE_HALTED is set.  

This manifested itself on my Stratus machine while running repeated
device hotplug removals at various timing intervals during
initialization (for example 3, 6 or 12 seconds after hotplug add).  I
didn't grab a backtrace of xhci_halt, so I'm not exactly sure why it was
called after xhci_run, but note that it is not only called from
xhci_stop, but also xhci_handle_command_timeout and others.

The result was that a later device initialization couldn't allocate its
MSI interrupts -- after much instrumentation I finally traced it back to
the previous initialization / removal failing to free the MSI irqs it
requested. 

Here's the relevent snippet from added trace logging.
(ftmod_ioboard_event and ioboard_bringdown are functions from an
out-of-tree PCI hotplug driver for Stratus hardware platforms.  These
routines will call into PCI API to add or remove a set of PCI devices so
I left them here to help mark device lifetimes.)

Last successful MSI allocation
==============================

            java-17858 [002] ....  3005.763238: ftmod_ioboard_event: adding bus 03
            java-17858 [009] ....  3009.780359: usb_add_hcd: request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8808518e53d8)) return=0
            java-17858 [004] ....  3009.808979: usb_add_hcd: request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff880482d992a8)) return=0
            java-17858 [008] ....  3009.828263: xhci_halt: xhci(ffff88083f5a8258)
            java-17858 [008] ....  3009.828269: xhci_quiesce: xhci(ffff88083f5a8258) halted=1
            java-17858 [008] ....  3009.828273: xhci_quiesce: xhci(ffff88083f5a8258) cmd=0
            java-17858 [008] ....  3009.829614: xhci_setup_msix: xhci=ffff88083f5a8258
            java-17858 [008] ....  3009.829627: xhci_setup_msix: xhci=ffff88083f5a8258 Failed to enable MSI-X, ret=-22
            java-17858 [008] ....  3009.829635: xhci_run: xhci=ffff88083f5a8258
            java-17858 [008] ....  3009.830069: xhci_run: request_irq(119, ...) MSI desc=ffff88083ac32e68 ret=0
            java-17858 [007] ....  3010.009977: xhci_halt: xhci(ffff88083f5a8258)

xhci_halt is called, perhaps from
xhci_handle_command_timeout?

            java-17858 [007] ....  3010.009985: xhci_quiesce: xhci(ffff88083f5a8258) halted=1
            java-17858 [007] ....  3010.009986: xhci_quiesce: xhci(ffff88083f5a8258) cmd=ffffffff

cmd = ~0, so the hardware is gone!  subsequent
call to xhci_handshake will detect this and return
-ENODEV, xhci_halt will then set 
xhci->xhc_state |= XHCI_STATE_HALTED

            java-17858 [007] d...  3010.010381: xhci_irq: hcd(ffff88083f5a8008)->irq=0 (depth=0, count=300, unhandled=0), status==~0

xhci_irq notices the missing hardware, too.

            java-17858 [007] ....  3010.014476: xhci_stop: xhci(ffff88083f5a8258)->xhc_state=2

xhci_stop is called and exits early because
(xhci->xhc_state & XHCI_STATE_HALTED)
Notice no entries for xhci_free_irq!

            java-17858 [007] ....  3010.029101: ftmod_ioboard_event: adding bus 79
   kworker/u97:3-974   [009] ....  3013.264132: ioboard_bringdown: board 0
   kworker/u97:3-974   [009] ....  3013.264145: ioboard_bringdown: sbus=0, removing bus 03
   kworker/u97:3-974   [005] ....  3016.551113: ioboard_bringdown: sbus=1, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551119: ioboard_bringdown: sbus=2, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551119: ioboard_bringdown: sbus=3, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551119: ioboard_bringdown: sbus=4, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551120: ioboard_bringdown: sbus=5, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551120: ioboard_bringdown: sbus=6, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551120: ioboard_bringdown: sbus=7, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551120: ioboard_bringdown: sbus=8, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551120: ioboard_bringdown: sbus=9, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551124: ioboard_bringdown: sbus=10, pdev =           (null)
   kworker/u97:3-974   [005] ....  3016.551125: ioboard_bringdown: sbus=11, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551126: ioboard_bringdown: sbus=12, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551126: ioboard_bringdown: sbus=13, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551126: ioboard_bringdown: sbus=14, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551127: ioboard_bringdown: sbus=15, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551127: ioboard_bringdown: sbus=16, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551127: ioboard_bringdown: sbus=17, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551127: ioboard_bringdown: sbus=18, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551127: ioboard_bringdown: sbus=19, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551128: ioboard_bringdown: sbus=20, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551128: ioboard_bringdown: sbus=21, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551128: ioboard_bringdown: sbus=22, bs->TopLevelBridge[sbus] = 0
   kworker/u97:3-974   [005] ....  3016.551128: ioboard_bringdown: sbus=23, bs->TopLevelBridge[sbus] = 0

First failed MSI allocation
===========================

            java-17858 [011] ....  3084.238393: ftmod_ioboard_event: adding bus 03
            java-17858 [005] ....  3088.071479: usb_add_hcd: request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8804f5cc2548)) return=0
            java-17858 [005] ....  3088.096260: usb_add_hcd: request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff8804f5cc6678)) return=0
            java-17858 [000] .N..  3088.121585: xhci_halt: xhci(ffff880857730258)
            java-17858 [000] .N..  3088.121593: xhci_quiesce: xhci(ffff880857730258) halted=1
            java-17858 [000] .N..  3088.121597: xhci_quiesce: xhci(ffff880857730258) cmd=0
            java-17858 [000] ....  3088.123387: xhci_setup_msix: xhci=ffff880857730258
            java-17858 [000] ....  3088.123400: xhci_setup_msix: xhci=ffff880857730258 Failed to enable MSI-X, ret=-22
            java-17858 [000] ....  3088.123408: xhci_run: xhci=ffff880857730258
            java-17858 [000] ....  3088.123486: msi_domain_alloc_irqs: virq = -17

#define EEXIST          17      /* File exists */

            java-17858 [000] ....  3088.123495: pci_enable_msi_range: pci_msi_setup_msi_irqs ret=-28
            java-17858 [000] ....  3088.123497: xhci_run: xhci=ffff880857730258 failed to allocate MSI entry, ret=-28

#define ENOSPC          28      /* No space left on device */

            java-17858 [000] ....  3088.123548: xhci_run: request_irq(10, ...) legacy desc=ffff88085f41bba8 ret=0
          <idle>-0     [000] d.H.  3088.388279: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=1, unhandled=0), status=0 (!STS_EINT)
          <idle>-0     [000] d.H.  3088.396057: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=2, unhandled=0), status=0 (!STS_EINT)
          <idle>-0     [000] d.H.  3088.396424: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=3, unhandled=0), status=0 (!STS_EINT)
     ksoftirqd/0-3     [000] d.H.  3088.491807: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=4, unhandled=0), status=0 (!STS_EINT)
    kworker/0:1H-984   [000] d.H.  3088.495645: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=5, unhandled=0), status=0 (!STS_EINT)
    kworker/0:1H-984   [000] d.H.  3088.496136: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=6, unhandled=0), status=0 (!STS_EINT)
          <idle>-0     [000] dNH.  3088.508039: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=8, unhandled=0), status=0 (!STS_EINT)
    kworker/0:1H-984   [000] d.H.  3088.511258: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=9, unhandled=0), status=0 (!STS_EINT)
    kworker/0:1H-984   [000] d.H.  3088.515058: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=10, unhandled=0), status=0 (!STS_EINT)
     ksoftirqd/0-3     [000] d.h.  3088.528961: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=11, unhandled=0), status=0 (!STS_EINT)
     ksoftirqd/0-3     [000] d.H.  3088.529620: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=12, unhandled=0), status=0 (!STS_EINT)
     ksoftirqd/0-3     [000] d.H.  3088.532033: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=13, unhandled=0), status=0 (!STS_EINT)
     kworker/0:2-28078 [000] d.H.  3088.538757: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=14, unhandled=0), status=0 (!STS_EINT)
     kworker/0:2-28078 [000] d.H.  3088.539227: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=15, unhandled=0), status=0 (!STS_EINT)
     kworker/0:2-28078 [000] dNH.  3088.542054: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=16, unhandled=0), status=0 (!STS_EINT)
    kworker/0:1H-984   [000] d.h.  3088.548702: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=17, unhandled=0), status=0 (!STS_EINT)
    kworker/0:1H-984   [000] d.h.  3088.549322: xhci_irq: hcd(ffff880857730008)->irq=10 (depth=0, count=18, unhandled=0), status=0 (!STS_EINT)

This continues until the device is removed later.

I'm currently testing with reverting 8c24d6d7b09d "usb: xhci: stop
everything on the first call to xhci_stop" to see how that fares.

In the meantime I was browsing recent linux-scsi archives and noticed
Gabriel's [PATCH v2] xhci: Cleanup only when releasing primary hcd:

  http://marc.info/?l=linux-usb&m=146134802313366&w=2

which looks like it modifies xhci_stop() to continue on to
xhci_cleanup_msix even if XHCI_STATE_HALTED (for primary host
controller's anyway).

I think this may avoid the issue that I've been seeing and was wondering
if it would be safe to test this patch on top of 4.6.0-rc3?  (An older
RC because my repro is very timing dependent, so moving the kernel
forward / backwards or adding excessive trace logging can make it hard
to debug.)

Regards,

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux