Re: XHCI abort CMD failure

Mathias Nyman <mathias.nyman@xxxxxxxxx> · Wed, 27 Feb 2019 09:31:39 +0200

Hi

On 26.2.2019 19.55, Shah, Nehal-bakulchandra wrote:
Hi

In one of our customer platform, we are getting following errors

[65136.606651] xhci_hcd 0000:00:10.0: Command timeout
[65136.606690] xhci_hcd 0000:00:10.0: Abort command ring
[65150.739738] xhci_hcd 0000:00:10.0: Abort failed to stop command ring: -110
[65150.740115] xhci_hcd 0000:00:10.0: // Halt the HC
[65150.785382] xhci_hcd 0000:00:10.0: Host halt failed, -110
[65150.785419] xhci_hcd 0000:00:10.0: xHCI host controller not responding, assume dead
[65150.785874] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 1, ep index 0
[65150.785882] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 1, ep index 2
[65150.785911] xhci_hcd 0000:00:10.0: xHCI dying, ignoring interrupt. Shouldn't IRQs be disabled?
[65150.785921] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 2, ep index 0
[65150.785927] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 2, ep index 2
[65150.785937] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 3, ep index 0
[65150.785943] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 3, ep index 2
[65150.785971] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 3, ep index 3
[65150.785978] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 3, ep index 6
[65150.785987] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 4, ep index 0
[65150.785993] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 4, ep index 2
[65150.786003] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 4, ep index 4
[65150.786012] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 5, ep index 0
[65150.786018] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 5, ep index 2
[65150.786027] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 6, ep index 0
[65150.786033] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 6, ep index 2
[65150.786039] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 6, ep index 3
[65150.786046] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 7, ep index 0
[65150.786052] xhci_hcd 0000:00:10.0: Killing URBs for slot ID 8, ep index 0
[65150.786059] xhci_hcd 0000:00:10.0: HC died; cleaning up
[65150.786597] xhci_hcd 0000:00:10.0: Timeout while waiting for setup device command

So as per my understanding, we are getting time out in abort command as CRR is not getting negated and it assumes controller is died. Now post this
host goes completely in weird state. So what can be the recovery mechanism? The comment in  xhci_abort_cmd_ring function says that "In the future we should distinguish between -ENODEV and -ETIMEDOUT * and try to recover a -ETIMEDOUT with a host controller reset."

What kernel version is this issue seen on?
I recall there being some race issue in this area some time ago.

Will it be a good idea to reset the controller or any other suggestion for recovery ? Current situation demands the rebooting of the system.

Yes, I think it would be a good idea to try to reset the host in -ETIMEDOUT case.
So far the most common case was that host controller was actually removed (PCI hotplug)
in the case of first a command timing out, and then aborting the command ring timing out, so
just tearing down the host has so far been enough.

Now we just need to implement this :)

-Mathias