Re: xHCI host dies on device unplug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21.12.2022 9.14, Ladislav Michl wrote:
+Cc: Sneeker Yeh

On Mon, Dec 19, 2022 at 10:45:43PM +0100, Ladislav Michl wrote:
On Mon, Dec 19, 2022 at 07:31:02PM +0100, Ladislav Michl wrote:
On Mon, Dec 19, 2022 at 02:25:46PM +0200, Mathias Nyman wrote:
On 16.12.2022 23.32, Ladislav Michl wrote:
On Fri, Dec 16, 2022 at 12:13:23PM +0200, Mathias Nyman wrote:
On 15.12.2022 18.12, Ladislav Michl wrote:
+Cc Mathias as he last touched this code path and may know more :)

On Tue, Dec 06, 2022 at 02:17:08PM +0100, Ladislav Michl wrote:
On Mon, Dec 05, 2022 at 10:27:57PM +0100, Ladislav Michl wrote:
I'm running current linux.git on custom Marvell OCTEON III CN7020
based board. USB devices like FTDI (idVendor=0403, idProduct=6001,
bcdDevice= 6.00) Realtek WiFi dongle (idVendor=0bda, idProduct=8179,
bcdDevice= 0.00) works without issues, while Ralink WiFi dongle
(idVendor=148f, idProduct=5370, bcdDevice= 1.01) kills the host on
disconnect:
xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command
xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead
xhci-hcd xhci-hcd.0.auto: HC died; cleaning up

Unfortunately I do not have a datasheet for CN7020 SoC, so it is hard
to tell if there is any errata :/ In case anyone see a clue in debug
logs bellow, I'll happily give it a try.

So I do have datasheet now. As a wild guess I tried to use dlmc_ref_clk0
instead of dlmc_ref_clk1 as a refclk-type-ss and it fixed unplug death.
I have no clue why, but anyway - sorry for the noise :) Perhaps Octeon's
clock init is worth to be verified...

After all whenever xhci dies with "xHCI host not responding to stop endpoint
command" depends also on temperature, so there seems to be race somewhere.

As a quick and dirty verification, whenever xhci really died, following patch
was tested and it fixed issue. It just treats ep as if stop endpoint command
succeeded. Any clues? I'll happily provide more traces.

It's possible the controller did complete the stop endpoint command but driver
didn't get the interrupt for the event for some reason.


Looks like controller didn't complete the stop endpoint command.

Event for last completed command (before cycle bit change "c" -> "C") was:
   0x00000000028f55a0: TRB 00000000035e81a0 status 'Success' len 0 slot 1 ep 0 type 'Command Completion Event' flags e:c,

This was for command at 35e81a0, which in the command ring was:
   0x00000000035e81a0: Reset Endpoint Command: ctx 0000000000000000 slot 1 ep 3 flags T:c

The stop endpoint command was the next command queued, at 35e81b0:
   0x00000000035e81b0: Stop Ring Command: slot 1 sp 0 ep 3 flags c

There were a lot of URBs queued for this device, and they are cancelled one by one after disconnect.

Was this the only device connected? If so does connecting another usb device to another root port help?
Just to test if the host for some reason partially stops a while after last device disconnect?

Device is connected directly into SoC. Once connected into HUB, host doesn't die
(as noted in other email, sorry for not replying to my own message, so it got lost)
It seems as intentional (power management?) optimization. If another device is
plugged in before 5 sec timeout expires, host completes stop endpoint command.

Unfortunately I cannot find anything describing this behavior in
documentation, so I'll ask manufacturer support.

As support is usually slow I asked search engine first and this sounds
familiar:
"Synopsis Designware USB3 IP earlier than v3.00a which is configured in silicon
with DWC_USB3_SUSPEND_ON_DISCONNECT_EN=1, would need a specific quirk to prevent
xhci host controller from dying when device is disconnected."

usb: dwc3: Add quirk for Synopsis device disconnection errata
https://patchwork.kernel.org/project/linux-omap/patch/1424151697-2084-5-git-send-email-Sneeker.Yeh@xxxxxxxxxxxxxx/

Any clue what happened with that? I haven't found any meaningfull traces...

Lets step back a bit. All test so far was done with mainline 6.1.0 kernel.
I also tested Marvell's vendor tree, one based on 4.9.79, second on 5.4.30,
both heavily patched. The last version of above patch I found is v5:
https://lkml.org/lkml/2015/2/21/260


Looked at that same series and turned patch 1/5 into a standalone quick hack that applies on 6.1

Untested, does it work for you?

Can be found in a delay_csc_clear branch in my tree:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git delay_csc_clear
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=delay_csc_clear

looks like this: (copypasted, might mess up tabs)

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 4619d5e89d5b..5bc1f78b41da 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -603,6 +603,10 @@ static void xhci_clear_port_change_bit(struct xhci_hcd *xhci, u16 wValue,
                port_change_bit = "warm(BH) reset";
                break;
        case USB_PORT_FEAT_C_CONNECTION:
+               if (1 && !(readl(addr) & PORT_CONNECT)) { /* add proper quirk */
+                       xhci_warn(xhci, "Delay clearing port-%d CSC\n", wIndex + 1);
+                       return;
+               }
                status = PORT_CSC;
                port_change_bit = "connect";
                break;
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 79d7931c048a..133ec4b8930f 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4000,6 +4000,28 @@ static void xhci_free_dev(struct usb_hcd *hcd, struct usb_device *udev)
                virt_dev->eps[i].ep_state &= ~EP_STOP_CMD_PENDING;
        virt_dev->udev = NULL;
        xhci_disable_slot(xhci, udev->slot_id);
+
+       if (1 && udev->parent && !udev->parent->parent) { /*fixme, real quirk */
+               struct xhci_hub *rhub;
+               u32 portsc;
+
+               rhub = xhci_get_rhub(hcd);
+
+               if (udev->portnum > rhub->num_ports) {
+                       xhci_warn(xhci, "Invalid portnum %d for late clearing CSC\n", udev->portnum);
+                       goto out;
+               }
+
+               portsc = readl(rhub->ports[udev->portnum - 1]->addr);
+
+               if (!(portsc & PORT_CONNECT) && (portsc & PORT_CSC)) {
+                       xhci_warn(xhci, "Late clearing port-%d CSC, portsc 0x%x\n",
+                                 udev->portnum, portsc);
+                       portsc = xhci_port_state_to_neutral(portsc);
+                       writel(portsc | PORT_CSC, rhub->ports[udev->portnum - 1]->addr);
+               }
+       }
+out:
        xhci_free_virt_device(xhci, udev->slot_id);
 }


Thanks
-Mathias



[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux