On 4/14/2019 10:59 PM, Lukas Wunner wrote: > On Sun, Apr 14, 2019 at 09:36:41PM +0200, Lukas Wunner wrote: >> I suppose this can happen if a write to the Slot Control register is >> performed while HPIE and/or CCIE is disabled, the two notifications >> are subsequently enabled and another write to the Slot Control is >> performed. That second write will then call wait_event_timeout() >> because of the stale ctrl->cmd_busy == 1, but the Command Complete >> notification has already happened and was cleared by pcie_poll_cmd(), >> hence it times out. >> >> Sounds reasonable, I'm a little suprised though that I've never seen >> this myself. I guess we've been doing this wrong for years, so: > On second thought, it's not surprising at all that I never saw this > because Thunderbolt sets NoCompl+, so doesn't use Command Complete > notifications. > > I suspect that even though we've been doing this wrong for a long time, > the bug was exposed by a recent change to pciehp. Do you happen to > know since which kernel version or commit you've been witnessing the > timeouts? Hi Lukas, thank you for your time. We started seeing these timeouts when we went to 4.20.5 from 4.14.61. In pcie_init(), there's a check that turns off a slot if it's powered on but unoccupied. Before 4e6a13356f1c ("PCI: pciehp: Deduplicate presence check on probe & resume"), that power check was at the end of pcie_probe(), after the IRQ is requested. I've investigated a little and found that the delays go away if the power check is moved back where it was before that commit. - Spencer