xhci_hcd 0000:11:00.0: HW died, polling stopped.

Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx> · Wed, 01 May 2013 01:07:48 +0200

Hi,
  I just tried 3.9 kernel with pcie_aspm=off and in another attempt with pcie_aspm=native.
I realized the message "HW died" happens only in the former case.

  I believe this is a bug. If I unplug an express card with a NEC-based USB3 host
it should be properly terminated, and xhci_hcd should unbind *even* when
"HW died" happened. It is not the case now so I have to do:

echo 1 > /sys/bus/pci/devices/0000:11:00.0/remove

to get rid of the stale 11:00 device from my system (sysfs entries):

/proc/iomem
       f1104000-f1104fff : r8169
   f6800000-f6bfffff : 0000:00:02.0
   f6c00000-f7cfffff : PCI Bus 0000:11
-    f6c00000-f6c01fff : 0000:11:00.0
-      f6c00000-f6c01fff : xhci_hcd
   f7d00000-f7dfffff : PCI Bus 0000:0b
     f7d00000-f7d0ffff : 0000:0b:00.0
       f7d00000-f7d0ffff : xhci_hcd

/proc/interrupts:
- 45:          1          0   PCI-MSI-edge      xhci_hcd
- 46:          0          0   PCI-MSI-edge      xhci_hcd
- 47:          0          0   PCI-MSI-edge      xhci_hcd

Let's say that when pcie_aspm=off the first hot eject of the express card
with the USB3.0 controller does not result in "HW died" but in "HC error bitmask = 0x4",
whatever that means. That is because of pciehp being broken under pcie_aspm=off
(unlike under pcie_aspm=native) but is not the story for linux-usb.

[   62.960729] xhci_hcd 0000:0b:00.0: Poll event ring: 4294943584
[   62.960732] xhci_hcd 0000:11:00.0: Poll event ring: 4294943584
[   62.960757] xhci_hcd 0000:11:00.0: op reg status = 0x0
[   62.960763] xhci_hcd 0000:11:00.0: ir_set 0 pending = 0x2
[   62.960764] xhci_hcd 0000:11:00.0: HC error bitmask = 0x4
[   62.960765] xhci_hcd 0000:11:00.0: Event ring:
[   62.960768] xhci_hcd 0000:11:00.0: @00000000d6020400 d6020000 00000000 01003028 0000c001
[   62.960769] xhci_hcd 0000:0b:00.0: op reg status = 0x0
[   62.960771] xhci_hcd 0000:11:00.0: @00000000d6020410 00000000 00000000 00000000 00000000
[   62.960772] xhci_hcd 0000:11:00.0: @00000000d6020420 00000000 00000000 00000000 00000000
[   62.960773] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
[   62.960775] xhci_hcd 0000:11:00.0: @00000000d6020430 00000000 00000000 00000000 00000000
[   62.960776] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
[   62.960777] xhci_hcd 0000:11:00.0: @00000000d6020440 00000000 00000000 00000000 00000000

The kernel is still looking for the device, silly, the device is ejected from the express card
slot already:

+[   62.961160] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr low bits + flags = @00000008
+[   62.961161] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr high bits = @00000000

A subsequent hot re-insert of the card is unnoticed by pciehp (due to a bug cause by pcie_aspm=off)
and therefore, xhci_hcd is puzzled and spits out:

+[  123.191537] xhci_hcd 0000:0b:00.0: Poll event ring: 4294949600
+[  123.191547] xhci_hcd 0000:11:00.0: Poll event ring: 4294949600
+[  123.191557] xhci_hcd 0000:11:00.0: op reg status = 0xffffffff
+[  123.191563] xhci_hcd 0000:0b:00.0: op reg status = 0x0
+[  123.191570] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
+[  123.191574] xhci_hcd 0000:11:00.0: HW died, polling stopped.
+[  123.191580] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0

At this step xhci_hcd should unbind the dead device so that it's sysfs entries could be removed
(bot iomem and interrupts). If that doe not happen or is not done manually a subsequent
hot insert has no chance to succeed and will silently proceed but device is left unconfigured
and sysfs entries show just crappy cached values. This can be demonstrated when a desperate users
inserts a different express card (a mixture of both is shown in lspci entries but only the old
data in sysfs entries). Lets cleanup the mess and ensure xhci_hcd releases resources allocated
by the dead device.

I speculate the "HC error bitmask = 0x4" should result in a "HW died" case as well.

Thank you,
Martin
P.S.: Collected dmesg/lspci/iomem/interrupts data are at: http://195.113.57.32/~mmokrejs/tmp/20130430.tar.bz2
in 3.9/off subdirectory (the pcie_aspm=off case). The working pcie_aspm=native behavior is documented
under 3.9/native subdirectory.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html