Re: xhci_hcd 0000:11:00.0: HW died, polling stopped.

Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> · Tue, 30 Apr 2013 17:36:17 -0700

The "HW died, polling stopped" message is harmless.  It happens when the
xHCI host goes into a PCI low power state (D3).  When the PCI host goes
into D3cold, the registers will read as all Fs, and the polling loop
will mistakenly believe the hardware has been removed.  However, this
bug only effects the debug code.  It does not effect any other part of
the xHCI driver.

Please disregard the "HW died, polling stopped" messages in dmesg.

Sarah Sharp

On Wed, May 01, 2013 at 01:07:48AM +0200, Martin Mokrejs wrote:
> Hi,
>   I just tried 3.9 kernel with pcie_aspm=off and in another attempt with pcie_aspm=native.
> I realized the message "HW died" happens only in the former case.
> 
>   I believe this is a bug. If I unplug an express card with a NEC-based USB3 host
> it should be properly terminated, and xhci_hcd should unbind *even* when
> "HW died" happened. It is not the case now so I have to do:
> 
> echo 1 > /sys/bus/pci/devices/0000:11:00.0/remove
> 
> to get rid of the stale 11:00 device from my system (sysfs entries):
> 
> /proc/iomem
>        f1104000-f1104fff : r8169
>    f6800000-f6bfffff : 0000:00:02.0
>    f6c00000-f7cfffff : PCI Bus 0000:11
> -    f6c00000-f6c01fff : 0000:11:00.0
> -      f6c00000-f6c01fff : xhci_hcd
>    f7d00000-f7dfffff : PCI Bus 0000:0b
>      f7d00000-f7d0ffff : 0000:0b:00.0
>        f7d00000-f7d0ffff : xhci_hcd
> 
> 
> /proc/interrupts:
> - 45:          1          0   PCI-MSI-edge      xhci_hcd
> - 46:          0          0   PCI-MSI-edge      xhci_hcd
> - 47:          0          0   PCI-MSI-edge      xhci_hcd
> 
> 
> 
> Let's say that when pcie_aspm=off the first hot eject of the express card
> with the USB3.0 controller does not result in "HW died" but in "HC error bitmask = 0x4",
> whatever that means. That is because of pciehp being broken under pcie_aspm=off
> (unlike under pcie_aspm=native) but is not the story for linux-usb.
> 
> [   62.960729] xhci_hcd 0000:0b:00.0: Poll event ring: 4294943584
> [   62.960732] xhci_hcd 0000:11:00.0: Poll event ring: 4294943584
> [   62.960757] xhci_hcd 0000:11:00.0: op reg status = 0x0
> [   62.960763] xhci_hcd 0000:11:00.0: ir_set 0 pending = 0x2
> [   62.960764] xhci_hcd 0000:11:00.0: HC error bitmask = 0x4
> [   62.960765] xhci_hcd 0000:11:00.0: Event ring:
> [   62.960768] xhci_hcd 0000:11:00.0: @00000000d6020400 d6020000 00000000 01003028 0000c001
> [   62.960769] xhci_hcd 0000:0b:00.0: op reg status = 0x0
> [   62.960771] xhci_hcd 0000:11:00.0: @00000000d6020410 00000000 00000000 00000000 00000000
> [   62.960772] xhci_hcd 0000:11:00.0: @00000000d6020420 00000000 00000000 00000000 00000000
> [   62.960773] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
> [   62.960775] xhci_hcd 0000:11:00.0: @00000000d6020430 00000000 00000000 00000000 00000000
> [   62.960776] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
> [   62.960777] xhci_hcd 0000:11:00.0: @00000000d6020440 00000000 00000000 00000000 00000000
> 
> The kernel is still looking for the device, silly, the device is ejected from the express card
> slot already:
> 
> +[   62.961160] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr low bits + flags = @00000008
> +[   62.961161] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr high bits = @00000000
> 
> A subsequent hot re-insert of the card is unnoticed by pciehp (due to a bug cause by pcie_aspm=off)
> and therefore, xhci_hcd is puzzled and spits out:
> 
> +[  123.191537] xhci_hcd 0000:0b:00.0: Poll event ring: 4294949600
> +[  123.191547] xhci_hcd 0000:11:00.0: Poll event ring: 4294949600
> +[  123.191557] xhci_hcd 0000:11:00.0: op reg status = 0xffffffff
> +[  123.191563] xhci_hcd 0000:0b:00.0: op reg status = 0x0
> +[  123.191570] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
> +[  123.191574] xhci_hcd 0000:11:00.0: HW died, polling stopped.
> +[  123.191580] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
> 
> At this step xhci_hcd should unbind the dead device so that it's sysfs entries could be removed
> (bot iomem and interrupts). If that doe not happen or is not done manually a subsequent
> hot insert has no chance to succeed and will silently proceed but device is left unconfigured
> and sysfs entries show just crappy cached values. This can be demonstrated when a desperate users
> inserts a different express card (a mixture of both is shown in lspci entries but only the old
> data in sysfs entries). Lets cleanup the mess and ensure xhci_hcd releases resources allocated
> by the dead device.
> 
> I speculate the "HC error bitmask = 0x4" should result in a "HW died" case as well.
> 
> 
> Thank you,
> Martin
> P.S.: Collected dmesg/lspci/iomem/interrupts data are at: http://195.113.57.32/~mmokrejs/tmp/20130430.tar.bz2
> in 3.9/off subdirectory (the pcie_aspm=off case). The working pcie_aspm=native behavior is documented
> under 3.9/native subdirectory.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html