Re: xhci_hcd 0000:11:00.0: HW died, polling stopped.

Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx> · Wed, 01 May 2013 23:28:45 +0200

Alan Stern wrote:
> On Wed, 1 May 2013, Martin Mokrejs wrote:
> 
>> Sarah Sharp wrote:
>>> The "HW died, polling stopped" message is harmless.  It happens when the
>>> xHCI host goes into a PCI low power state (D3).  When the PCI host goes
>>> into D3cold, the registers will read as all Fs, and the polling loop
>>> will mistakenly believe the hardware has been removed.  However, this
>>> bug only effects the debug code.  It does not effect any other part of
>>> the xHCI driver.
>>
>> I think I do not mind it affects just the XHCI_DEBUG stuff. I just refer
>> to "those" places in the source code where something else *could* happen:
>> a detection of a silently ejected or dead hardware.
>>
>> I really did unplug the express card providing second USB3.0 controller
>> (11:00). My point was that although pciehp did not propagate the hot eject
>> to downstream drivers (xhci_hcd) I believe xhci_hcd could have realized it
>> by itself because it does polling time to time and this, albeit debugging
>> code, shows where roughly something more clever could happen. Ideally in
>> place of the "HC error bitmask = 0x4" (due to un-notified hot removal) or
>> at least at the time when "HW died, polling stopped" was printed
>> (un-notified hot-reinsert) xhci_hcd could realize a device is gone.
> 
> That's not how drivers work in Linux.  They don't unbind all by 
> themselves; they wait until the bus-level code tells them to unbind.
> xhci-hcd is not alone in this respect; all the drivers behave this way.

I don't believe that. From my tests only the USB3 express card suffered
"the problem" unlike firewire_ohci and sata_sil24 -based cards.

Do you remember the thread https://lkml.org/lkml/2012/4/16/566
... where about 60 sec timeout was needed to have usb working again?
I think I saw meanwhile other talking about 30 sec delay but I believe this
would all be easier if xhci_hcd did unbind itself from a dead device.

I am naively thinking that PCI has no way to detect a card was hot unplugged
if e.g. hotplug was completely left out of a kernel .config or when acpiphp/pciehp
don't work, for whatever reason. But, xhci_hcd has the unique advantage that it
does polling and it know the device is dead. Probably same applies to uhci/ehci.
I just don't believe if an upper level realizes a problem why it could not
take an action.

Other drivers probably don't do polling, by design, so they are in another
situation.

> 
>> So what can be done so that the user does not have to run 
>>
>> echo 1 > /sys/bus/pci/devices/0000:11:00.0/remove
>>
>> manually? Couldn't xhci_hcd detect somehow that the device is dead or ejected?
> 
> It could detect that the device is dead.  In fact, it probably detects 
> that now.  But even if it could tell that the device had been ejected, 
> it would not unbind itself.
> 
> What can be done is to fix the PCIe core code so that it correctly
> realizes when an eject takes place.

I believe once that will be fixed as I found that pciehp is broken
in its action by pcie_aspm=off whereas it works when pcie_aspm=native.
That in turn points to bad ASPM L0/L1 handling and seems similar to issues
others had with PCIe LnkCtl on iwlwifi. That is somehow related to those
OSC_ trickeries in acpi. Finally, seems other hit ASPM issues with Dell
Vostro laptops. :( This will all hopefully get fixed. But I want usb
fix as well. ;-)

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html