Re: [PATCH] xhci: Cleanup only when releasing primary hcd.

Mathias Nyman <mathias.nyman@xxxxxxxxx> · Thu, 21 Apr 2016 14:24:58 +0300

On 19.04.2016 21:13, Gabriel Krisman Bertazi wrote:
Under stress occasions some TI devices might not return early when
reading the status register during the quirk invocation of xhci_irq made
by usb_hcd_pci_remove.  This means that instead of returning, we end up
handling this interruption in the middle of a shutdown.  Since
xhci->event_ring has already been freed in xhci_mem_cleanup, we end up
accessing freed memory, causing the Oops below.

commit 8c24d6d7b09d ("usb: xhci: stop everything on the first call to
xhci_stop") is the one that changed the instant in which we clean up the
event queue when stopping a device.  Before, we didn't call
xhci_mem_cleanup at the first time xhci_stop is executed (for the shared
HCD), instead, we only did it after the invocation for the primary HCD,
much later at the removal path.  The code flow for this oops looks like
this:

xhci_pci_remove()
	usb_remove_hcd(xhci->shared)
	        xhci_stop(xhci->shared)
  			xhci_halt()
			xhci_mem_cleanup(xhci);  // Free the event_queue
	usb_hcd_pci_remove(primary)
		xhci_irq()  // Access the event_queue if STS_EINT is set. Crash.
		xhci_stop()
			xhci_halt()
			// return early

The fix modifies xhci_stop to only cleanup the xhci data when releasing
the primary HCD.  This way, we still have the event_queue configured
when invoking xhci_irq.  We still halt the device on the first call to
xhci_stop, though.

I could reproduce this issue several times on the mainline kernel by
doing a bind-unbind stress test with a specific storage gadget attached.
I also ran the same test over-night with my patch applied and didn't
observe the issue anymore.


Nice catch, and moving xhci_mem_cleanup() until second hcd (primary) is
removed is one way to solve it.

But I don't think we should even try to handle the interrupt at this stage anymore.
The host is already halted and normally the handler should not be called anymore.

How about handling interrupts for a halted host in the same way as a dying host?
Does something like this work for your TI devices?

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 99b4ff4..2cdc027 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2728,8 +2728,9 @@ hw_died:
                writel(irq_pending, &xhci->ir_set->irq_pending);
        }
 
-       if (xhci->xhc_state & XHCI_STATE_DYING) {
-               xhci_dbg(xhci, "xHCI dying, ignoring interrupt. "
+       if ((xhci->xhc_state & XHCI_STATE_DYING) ||
+           (xhci->xhc_state & XHCI_STATE_HALTED)) {
+               xhci_dbg(xhci, "xHCI halted or dying, ignoring interrupt. "
                                "Shouldn't IRQs be disabled?\n");
                /* Clear the event handler busy flag (RW1C);
                 * the event ring should be empty.


Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html