Re: xhci_hcd 0000:11:00.0: HW died, polling stopped.

Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx> · Wed, 01 May 2013 03:03:31 +0200

Sarah Sharp wrote:
> The "HW died, polling stopped" message is harmless.  It happens when the
> xHCI host goes into a PCI low power state (D3).  When the PCI host goes
> into D3cold, the registers will read as all Fs, and the polling loop
> will mistakenly believe the hardware has been removed.  However, this
> bug only effects the debug code.  It does not effect any other part of
> the xHCI driver.

I think I do not mind it affects just the XHCI_DEBUG stuff. I just refer
to "those" places in the source code where something else *could* happen:
a detection of a silently ejected or dead hardware.

I really did unplug the express card providing second USB3.0 controller
(11:00). My point was that although pciehp did not propagate the hot eject
to downstream drivers (xhci_hcd) I believe xhci_hcd could have realized it
by itself because it does polling time to time and this, albeit debugging
code, shows where roughly something more clever could happen. Ideally in
place of the "HC error bitmask = 0x4" (due to un-notified hot removal) or
at least at the time when "HW died, polling stopped" was printed
(un-notified hot-reinsert) xhci_hcd could realize a device is gone.

Are we talking about the same?

And is the express card with NEC chip allowed to enter D3cold at all?

pcie_aspm=off:
[    1.680882] pci 0000:00:1c.7: PME# supported from D0 D3hot D3cold
[    1.680888] pci 0000:00:1c.7: PME# disabled
[    1.724471] pci 0000:11:00.0: PME# supported from D0 D3hot
[    1.724481] pci 0000:11:00.0: PME# disabled

pcie_aspm=native:
[    1.681021] pci 0000:00:1c.7: PME# supported from D0 D3hot D3cold
[    1.681027] pci 0000:00:1c.7: PME# disabled
[    1.753353] pci 0000:11:00.0: PME# supported from D0 D3hot
[    1.753363] pci 0000:11:00.0: PME# disabled

> Please disregard the "HW died, polling stopped" messages in dmesg.

So what can be done so that the user does not have to run 

echo 1 > /sys/bus/pci/devices/0000:11:00.0/remove

manually? Couldn't xhci_hcd detect somehow that the device is dead or ejected?

> 
> Sarah Sharp

Thank you,
Martin

> 
> On Wed, May 01, 2013 at 01:07:48AM +0200, Martin Mokrejs wrote:
>> Hi,
>>   I just tried 3.9 kernel with pcie_aspm=off and in another attempt with pcie_aspm=native.
>> I realized the message "HW died" happens only in the former case.
>>
>>   I believe this is a bug. If I unplug an express card with a NEC-based USB3 host
>> it should be properly terminated, and xhci_hcd should unbind *even* when
>> "HW died" happened. It is not the case now so I have to do:
>>
>> echo 1 > /sys/bus/pci/devices/0000:11:00.0/remove
>>
>> to get rid of the stale 11:00 device from my system (sysfs entries):
>>
>> /proc/iomem
>>        f1104000-f1104fff : r8169
>>    f6800000-f6bfffff : 0000:00:02.0
>>    f6c00000-f7cfffff : PCI Bus 0000:11
>> -    f6c00000-f6c01fff : 0000:11:00.0
>> -      f6c00000-f6c01fff : xhci_hcd
>>    f7d00000-f7dfffff : PCI Bus 0000:0b
>>      f7d00000-f7d0ffff : 0000:0b:00.0
>>        f7d00000-f7d0ffff : xhci_hcd
>>
>>
>> /proc/interrupts:
>> - 45:          1          0   PCI-MSI-edge      xhci_hcd
>> - 46:          0          0   PCI-MSI-edge      xhci_hcd
>> - 47:          0          0   PCI-MSI-edge      xhci_hcd
>>
>>
>>
>> Let's say that when pcie_aspm=off the first hot eject of the express card
>> with the USB3.0 controller does not result in "HW died" but in "HC error bitmask = 0x4",
>> whatever that means. That is because of pciehp being broken under pcie_aspm=off
>> (unlike under pcie_aspm=native) but is not the story for linux-usb.
>>
>> [   62.960729] xhci_hcd 0000:0b:00.0: Poll event ring: 4294943584
>> [   62.960732] xhci_hcd 0000:11:00.0: Poll event ring: 4294943584
>> [   62.960757] xhci_hcd 0000:11:00.0: op reg status = 0x0
>> [   62.960763] xhci_hcd 0000:11:00.0: ir_set 0 pending = 0x2
>> [   62.960764] xhci_hcd 0000:11:00.0: HC error bitmask = 0x4
>> [   62.960765] xhci_hcd 0000:11:00.0: Event ring:
>> [   62.960768] xhci_hcd 0000:11:00.0: @00000000d6020400 d6020000 00000000 01003028 0000c001
>> [   62.960769] xhci_hcd 0000:0b:00.0: op reg status = 0x0
>> [   62.960771] xhci_hcd 0000:11:00.0: @00000000d6020410 00000000 00000000 00000000 00000000
>> [   62.960772] xhci_hcd 0000:11:00.0: @00000000d6020420 00000000 00000000 00000000 00000000
>> [   62.960773] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
>> [   62.960775] xhci_hcd 0000:11:00.0: @00000000d6020430 00000000 00000000 00000000 00000000
>> [   62.960776] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
>> [   62.960777] xhci_hcd 0000:11:00.0: @00000000d6020440 00000000 00000000 00000000 00000000
>>
>> The kernel is still looking for the device, silly, the device is ejected from the express card
>> slot already:
>>
>> +[   62.961160] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr low bits + flags = @00000008
>> +[   62.961161] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr high bits = @00000000
>>
>> A subsequent hot re-insert of the card is unnoticed by pciehp (due to a bug cause by pcie_aspm=off)
>> and therefore, xhci_hcd is puzzled and spits out:
>>
>> +[  123.191537] xhci_hcd 0000:0b:00.0: Poll event ring: 4294949600
>> +[  123.191547] xhci_hcd 0000:11:00.0: Poll event ring: 4294949600
>> +[  123.191557] xhci_hcd 0000:11:00.0: op reg status = 0xffffffff
>> +[  123.191563] xhci_hcd 0000:0b:00.0: op reg status = 0x0
>> +[  123.191570] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
>> +[  123.191574] xhci_hcd 0000:11:00.0: HW died, polling stopped.
>> +[  123.191580] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
>>
>> At this step xhci_hcd should unbind the dead device so that it's sysfs entries could be removed
>> (bot iomem and interrupts). If that doe not happen or is not done manually a subsequent
>> hot insert has no chance to succeed and will silently proceed but device is left unconfigured
>> and sysfs entries show just crappy cached values. This can be demonstrated when a desperate users
>> inserts a different express card (a mixture of both is shown in lspci entries but only the old
>> data in sysfs entries). Lets cleanup the mess and ensure xhci_hcd releases resources allocated
>> by the dead device.
>>
>> I speculate the "HC error bitmask = 0x4" should result in a "HW died" case as well.
>>
>>
>> Thank you,
>> Martin
>> P.S.: Collected dmesg/lspci/iomem/interrupts data are at: http://195.113.57.32/~mmokrejs/tmp/20130430.tar.bz2
>> in 3.9/off subdirectory (the pcie_aspm=off case). The working pcie_aspm=native behavior is documented
>> under 3.9/native subdirectory.
>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html