Re: PCI: hotplug: Erroneous removal of hotplug PCI devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote:
> In PCIe, the presence detect bit (PD) in the slot status register should 
> be a logical OR of in-band and out-of band presence. In-band presence is 
> the data link layer status. So one would expect that a link up event, 
> would be accompanied by a PD changed event with PD set. Not everyone 
> follows that.
> 
> I have a system here with the following order of events:
>   *   0 ms : Link up
>   * 400 ms : Presence detect up
> On the first event, the device is probed as expected, and on the second 
> event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.

It's normal that there's a lag between presence and link changes.

There's even hardware which flaps the link and presence bits a
couple of times before they settle.  Since commit 6c35a1ac3da63a7
("PCI: pciehp: Tolerate initially unstable link") we're quite lenient
towards such devices and tolerate link and presence flaps for up to
100 ms.  That's basically also the maximum delay we allow between link up
and presence detect up.

Theoretically you could say:  Let's wait after the link goes up
until presence also goes up.  But guess what, some vendors managed
to hardwire the presence detect flag to zero, so you'd wait forever
on those devices.  We have a workaround specifically for such hardware
since commit 80696f991424 ("PCI: pciehp: Tolerate Presence Detect
hardwired to zero").


> It's obvious that just relying on presence detect state is prone to race 
> conditions. However, if a device is replaced, we'd expect the data link 
> layer state to change as well. So I think the best way to proceed is to 
> skip the SURPRISE!!!_REMOVAL if the following are true:
>   * presence detect is set
>   * DLL changed is not set
>   * presence detect was not previously set

Unfortunately it's not that simple.  All we know is that presence has
changed and is currently up and the link is also up.  This could mean
that the device was swapped and it may just take a little longer until
the link flag also flaps.

A possible solution for your hardware would be to amend
pciehp_check_link_status() such that after the call to pci_bus_check_dev()
we wait until presence also goes up.  However it would have to be quirked
to your particular hardware, otherwise it would break devices which
hardwire presence detect to zero.  Or we'd have to define a limit of,
say, 1 second and if presence hasn't gone up within that period,
we'd just silently carry on.  This wouldn't break hardware which hardwires
presence detect to zero, but introduce a 1 sec delay for that hardware
to bring up the slot.

So I don't see a perfect solution.  What device are we talking about
anyway?  400 ms is a *long* time.

Thanks,

Lukas



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux