Hi Mathias, On Wed, Sep 26, 2018 at 02:51:40PM +0300, Mathias Nyman wrote: > Hi Jack, Peter > > On 24.09.2018 19:37, Jack Pham wrote: > >Hi Peter, > > > >On Fri, Sep 21, 2018 at 09:48:43AM +0800, Peter Chen wrote: > >>Type-C-to-A cable, and the USB3 HCD has already been NULL at that time. > >>The oops log like below: > >> > >>[681.782288] xhci-hcd xhci-hcd.1.auto: remove, state 1 > >>[681.787490] usb usb4: USB disconnect, device number 1 > >>[681.792808] usb 4-1: USB disconnect, device number 2 > >>[681.818089] xhci-hcd xhci-hcd.1.auto: USB bus 4 deregistered > >>[681.823803] Unable to handle kernel NULL pointer dereference at virtual address 000000a0 > >>[681.823806] Mem abort info: > >>[681.823809] Exception class = DABT (current EL), IL = 32 bits > >>[681.823811] SET = 0, FnV = 0 > >>[681.823813] EA = 0, S1PTW = 0 > >>[681.823814] Data abort info: > >>[681.823816] ISV = 0, ISS = 0x00000004 > >>[681.823818] CM = 0, WnR = 0 > >>[681.823822] user pgtable: 4k pages, 48-bit VAs, pgd = ffff8000ae3fd000 > >>[681.823824] [00000000000000a0] *pgd=0000000000000000 > >>[681.823829] Internal error: Oops: 96000004 [#1] PREEMPT SMP > >>[681.823832] Modules linked in: 8021q garp stp mrp crc32_ce qca6174(O) crct10dif_ce galcore(O) > >>[681.823849] CPU: 0 PID: 94 Comm: kworker/0:1 Tainted: G O 4.14.62-imx_4.14.y+gcd63def #1 > >>[681.823851] Hardware name: Freescale i.MX8MQ EVK (DT) > >>[681.823862] Workqueue: events_freezable __dwc3_set_mode > >>[681.823865] task: ffff8000b8a18000 task.stack: ffff00000a010000 > >>[681.823872] PC is at xhci_irq+0x5fc/0x14b8 > >>[681.823875] LR is at xhci_irq+0x3c/0x14b8 > > > ><snip> > > > >>diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c > >>index f0a99aa0ac58..2dc5176b79d0 100644 > >>--- a/drivers/usb/host/xhci-ring.c > >>+++ b/drivers/usb/host/xhci-ring.c > >>@@ -2680,7 +2680,8 @@ irqreturn_t xhci_irq(struct usb_hcd *hcd) > >> } > >> if (xhci->xhc_state & XHCI_STATE_DYING || > >>- xhci->xhc_state & XHCI_STATE_HALTED) { > >>+ xhci->xhc_state & XHCI_STATE_HALTED || > >>+ xhci->xhc_state & XHCI_STATE_REMOVING) { > >> xhci_dbg(xhci, "xHCI dying, ignoring interrupt. " > >> "Shouldn't IRQs be disabled?\n"); > >> /* Clear the event handler busy flag (RW1C); > > > >We also noticed the same crash as you found, and tried to fix it in a > >similar way, but noticed that this still allows a memory leak to happen. > > > >It seems from commit fe190ed0d602a ("xhci: Do not halt the host until > >both HCD have disconnected their devices.") this was added to > >xhci_stop(), and is the reason we encounter the NULL pointer in > >xhci_irq() when it tries to access xhci->shared_hcd. > > > > + /* Only halt host and free memory after both hcds are removed */ > > if (!usb_hcd_is_primary_hcd(hcd)) { > > + /* usb core will free this hcd shortly, unset pointer */ > > + xhci->shared_hcd = NULL; > > mutex_unlock(&xhci->mutex); > > return; > > } > > > >While your fix will simply abort the xhci_irq() function if it > >encounters XHCI_STATE_REMOVING, during xhci_plat_remove(): > > > > static int xhci_plat_remove(struct platform_device *dev) > > { > > struct usb_hcd *hcd = platform_get_drvdata(dev); > > struct xhci_hcd *xhci = hcd_to_xhci(hcd); > > struct clk *clk = xhci->clk; > > struct clk *reg_clk = xhci->reg_clk; > > > > xhci->xhc_state |= XHCI_STATE_REMOVING; > > > > usb_remove_hcd(xhci->shared_hcd); > > ^^^^^^^^^ calls xhci_stop() and sets shared_hcd=NULL > > usb_phy_shutdown(hcd->usb_phy); > > > > usb_remove_hcd(hcd); > > usb_put_hcd(xhci->shared_hcd); > > ^^^^^^^^^^^ shared_hcd==NULL, so this is a no-op > > > >Since usb_put_hcd() doesn't get called for shared_hcd, we end up > >with one additional kref count and hence a leak. > > Nice catch, this same issue exists in > > xhci_pci_remove() > tegra_xusb_remove() > xhci_mtk_remove() > xhci_histb_remove() > > > > > >Wondering if we need to also remove the xhci->shared_hcd = NULL from > >xhci_stop(), in addition to your patches. Thoughts? > > At some point the xhci->shared_hcd needs to be set NULL, it can be done in > in the xhci_plat_remove(), xhci_pci_remove() and the similar remove functions > after calling usb_remove_hcd(). we can't rely on xhci->shared_hcd after it > has been removed. Currently it is already done as follows: usb_remove_hcd(xhci->shared_hcd) -> xhci_stop() /* Only halt host and free memory after both hcds are removed */ if (!usb_hcd_is_primary_hcd(hcd)) { /* usb core will free this hcd shortly, unset pointer */ xhci->shared_hcd = NULL; I guess at the very least your suggestion to do it from the xhci_{plat,pci,mtk,tegra}_remove() context makes it more explicit rather than having it as a side effect of usb_remove_hcd()/xhci_stop(). The net effect is more or less the same though IMO. > The xHC controller is stopped and interrupts disabled only after the primary > hcd (USB2) is removed, this is because usb core will try to cleanly flush last USB2 > URBs and take down the last endpoints when the USB2 usb_remove_hcd(hcd) is called. > We need a running xHC controller to do this. > > This means we can get interrupts we need to handle even if the shared_hcd is removed. > So I think we still need to handle interrupts even if XHCI_STATE_REMOVING flag is set. > > I think we also need to make the xhci interrupt handler capable of handling situations > where xhci->shared_hcd is set to NULL > > Does this sound reasonable? I see your point. So Peter's patchset might have unintended consequences by preventing further interrupts to be handled when there is some necessary cleanup to be done? I realize now that Peter and I are both encountering the NULL pointer on 4.14. I believe it's when xhci_irq() calls to handle_port_status(): /* Figure out which usb_hcd this port is attached to: * is it a USB 3.0 port or a USB 2.0/1.1 port? */ major_revision = xhci->port_array[port_id - 1]; /* Find the right roothub. */ hcd = xhci_to_hcd(xhci); if ((major_revision == 0x03) != (hcd->speed >= HCD_USB3)) hcd = xhci->shared_hcd; then later tries to dereference hcd which was taken from the NULL shared_hcd: bus_state = &xhci->bus_state[hcd_index(hcd)]; if (hcd->speed >= HCD_USB3) port_array = xhci->usb3_ports; else port_array = xhci->usb2_ports; Since 4.18 you had recently added commit 52c7755ba19e ("xhci: xhci-ring: use port structures for port event handler") which replaces the above by deriving hcd from the xhci->hw_ports array instead of assigning it from shared_hcd. So I guess this issue probably won't be encountered on mainline (sorry, I have yet to test this). Since the problem is there from 4.12 to 4.17 due to commit fe190ed0d602, do you have a suggestion on how to fix it for -stable? Thanks, Jack