Re: HC died

Seth Bollinger <seth.boll@xxxxxxxxx> · Thu, 23 Feb 2023 09:31:05 -0600

> > We're experiencing a problem with our devices in the field where our
> > customers attach problematic USB devices that are causing the xhci
> > host controller to shut down with the "HC died; cleaning up" message.
>
> Is this seen only on some specific xHC host controller?

I don't think so.  We've seen this on pcie attached asmedia 3142 and
NXP ls1012a/ls1046a SOC controllers (which I think are dwc3 IP).
Strangely the timing seems to be much easier to reproduce on the pcie
attached asm3142.

> > I've narrowed this down to a timeout of the address device TRB on the
> > command ring (currently 5 seconds).  It sometimes takes our hardware
> > 9.6 to complete this TRB.  When the driver is trying to stop the cmd
> > ring, the controller is busy for an additional 4.6 seconds.  This
> > results in the "HC died" message and shutdown of the host controller.
> >
> > If I bump the command ring timeout beyond the max TRB completion time,
> > the host controller continues to be responsive and doesn't need to be
> > shut down.
> >
> > My knowledge of how the usb protocol should handle this problem isn't
> > strong enough to know if there is a better solution than simply
> > increasing the command ring default timeout.
>
> Are these problematic devices USB 2 or USB 3 devices?

Both.

> You could try playing with the Address device command BSR (block set
> address request) flag and see if helps.
> Xhci has two ways to get a slot from the Enabled to the Addressed state.
>
> option 1: move slot from Enabled state to Addressed in one go:
> Enabled --(Addr dev cmd, BSR=0)--> Addressed
>
> option 2: move from Enabled state via Default state to Addressed state:
> Enabled --(Addr dev cmd, BSR=1)--> Default --(Addr dev cmd, BSR=0)--> Addressed
>
> I think the usb core "old_scheme_first" module parameter will end up doing this.

Apologies for taking so long to respond to this as I've been a little
busy this week.

I tried setting old_scheme_first and this didn't have any effect.
Here's the kernel log without my patch to track command ring TRB
completion times (as well as extra debug disabled).

kernel: usb 3-2.1: new high-speed USB device number 4 using xhci_hcd
kernel: usb 3-2.1: New USB device found, idVendor=058f,
idProduct=6387, bcdDevice= 1.03
kernel: usb 3-2.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
kernel: usb 3-2.1: Product: Mass Storage
kernel: usb 3-2.1: Manufacturer: Generic
kernel: usb 3-2.1: SerialNumber: B1EC2EB2
kernel: usb 3-2.1: USB disconnect, device number 4
kernel: usb 3-2.1: new high-speed USB device number 5 using xhci_hcd
kernel: xhci_hcd 0002:01:00.0: Abort failed to stop command ring: -110
kernel: xhci_hcd 0002:01:00.0: xHCI host controller not responding, assume dead
kernel: xhci_hcd 0002:01:00.0: HC died; cleaning up
kernel: xhci_hcd 0002:01:00.0: Timeout while waiting for setup device command
kernel: usb 3-1: USB disconnect, device number 2
kernel: usb 3-2: USB disconnect, device number 3
kernel: usb 4-1: USB disconnect, device number 2
kernel: usb 4-2: USB disconnect, device number 3
kernel: usb 3-2.1: device not accepting address 5, error -108
kernel: usb 3-2-port1: couldn't allocate usb_device

If I push XHCI_CMD_DEFAULT_TIMEOUT beyond 9.6 seconds, the HC will
continue to function normally.

>From a quick web search, I can see that other people are experiencing
the same issue.  None of those threads offer any solutions.  Many seem
to revolve around disabling usb power management, and this did not
help in our case.

I wish I could gain some insight on how the hardware is handling this edge case.

> -Mathias
>