> > We're experiencing a problem with our devices in the field where our > > customers attach problematic USB devices that are causing the xhci > > host controller to shut down with the "HC died; cleaning up" message. > > Is this seen only on some specific xHC host controller? I don't think so. We've seen this on pcie attached asmedia 3142 and NXP ls1012a/ls1046a SOC controllers (which I think are dwc3 IP). Strangely the timing seems to be much easier to reproduce on the pcie attached asm3142. > > I've narrowed this down to a timeout of the address device TRB on the > > command ring (currently 5 seconds). It sometimes takes our hardware > > 9.6 to complete this TRB. When the driver is trying to stop the cmd > > ring, the controller is busy for an additional 4.6 seconds. This > > results in the "HC died" message and shutdown of the host controller. > > > > If I bump the command ring timeout beyond the max TRB completion time, > > the host controller continues to be responsive and doesn't need to be > > shut down. > > > > My knowledge of how the usb protocol should handle this problem isn't > > strong enough to know if there is a better solution than simply > > increasing the command ring default timeout. > > Are these problematic devices USB 2 or USB 3 devices? Both. > You could try playing with the Address device command BSR (block set > address request) flag and see if helps. > Xhci has two ways to get a slot from the Enabled to the Addressed state. > > option 1: move slot from Enabled state to Addressed in one go: > Enabled --(Addr dev cmd, BSR=0)--> Addressed > > option 2: move from Enabled state via Default state to Addressed state: > Enabled --(Addr dev cmd, BSR=1)--> Default --(Addr dev cmd, BSR=0)--> Addressed > > I think the usb core "old_scheme_first" module parameter will end up doing this. Apologies for taking so long to respond to this as I've been a little busy this week. I tried setting old_scheme_first and this didn't have any effect. Here's the kernel log without my patch to track command ring TRB completion times (as well as extra debug disabled). kernel: usb 3-2.1: new high-speed USB device number 4 using xhci_hcd kernel: usb 3-2.1: New USB device found, idVendor=058f, idProduct=6387, bcdDevice= 1.03 kernel: usb 3-2.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3 kernel: usb 3-2.1: Product: Mass Storage kernel: usb 3-2.1: Manufacturer: Generic kernel: usb 3-2.1: SerialNumber: B1EC2EB2 kernel: usb 3-2.1: USB disconnect, device number 4 kernel: usb 3-2.1: new high-speed USB device number 5 using xhci_hcd kernel: xhci_hcd 0002:01:00.0: Abort failed to stop command ring: -110 kernel: xhci_hcd 0002:01:00.0: xHCI host controller not responding, assume dead kernel: xhci_hcd 0002:01:00.0: HC died; cleaning up kernel: xhci_hcd 0002:01:00.0: Timeout while waiting for setup device command kernel: usb 3-1: USB disconnect, device number 2 kernel: usb 3-2: USB disconnect, device number 3 kernel: usb 4-1: USB disconnect, device number 2 kernel: usb 4-2: USB disconnect, device number 3 kernel: usb 3-2.1: device not accepting address 5, error -108 kernel: usb 3-2-port1: couldn't allocate usb_device If I push XHCI_CMD_DEFAULT_TIMEOUT beyond 9.6 seconds, the HC will continue to function normally. >From a quick web search, I can see that other people are experiencing the same issue. None of those threads offer any solutions. Many seem to revolve around disabling usb power management, and this did not help in our case. I wish I could gain some insight on how the hardware is handling this edge case. > -Mathias >