Re: [PATCH] PCI: pciehp: Fix false command timeouts on boot

Spencer Lingard <spencer@xxxxxxxxxxxx> · Tue, 16 Apr 2019 14:32:55 -0400

On 4/14/2019 10:59 PM, Lukas Wunner wrote:
> On Sun, Apr 14, 2019 at 09:36:41PM +0200, Lukas Wunner wrote:
>> I suppose this can happen if a write to the Slot Control register is
>> performed while HPIE and/or CCIE is disabled, the two notifications
>> are subsequently enabled and another write to the Slot Control is
>> performed.  That second write will then call wait_event_timeout()
>> because of the stale ctrl->cmd_busy == 1, but the Command Complete
>> notification has already happened and was cleared by pcie_poll_cmd(),
>> hence it times out.
>>
>> Sounds reasonable, I'm a little suprised though that I've never seen
>> this myself.  I guess we've been doing this wrong for years, so:
> On second thought, it's not surprising at all that I never saw this
> because Thunderbolt sets NoCompl+, so doesn't use Command Complete
> notifications.
>
> I suspect that even though we've been doing this wrong for a long time,
> the bug was exposed by a recent change to pciehp.  Do you happen to
> know since which kernel version or commit you've been witnessing the
> timeouts?

Hi Lukas, thank you for your time.

We started seeing these timeouts when we went to 4.20.5 from 4.14.61.

In pcie_init(), there's a check that turns off a slot if it's powered on
but unoccupied. Before 4e6a13356f1c ("PCI: pciehp: Deduplicate presence
check on probe & resume"), that power check was at the end of
pcie_probe(), after the IRQ is requested. I've investigated a little and
found that the delays go away if the power check is moved back where it
was before that commit.

- Spencer