Re: Should a PCIe Link Down event set the PCI_DEV_DISCONNECTED bit?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Lukas,

+CC some more Dell folks.

On 07/27/2018 02:18 AM, Lukas Wunner wrote:
> On Thu, Jul 26, 2018 at 05:38:50PM -0500, Alex G. wrote:
>> I was under the impression that a DLLSC or PDSC would trigger the
>> PCI_DEV_DISCONNECTED bit to be set, blocking any further config access.
> 
> Only if the Presence Detect State bit in the Slot Status register is
> not set.
> 
> I think the idea was that if the card is still in the slot, its driver can
> be unbound orderly because the device is still accessible.  So there's no
> need to set PCI_DEV_DISCONNECTED.

This sounds counter to what I had intuited. I mentally associate 
DISCONNECTED with "the link is down". I don't understand the idea of a 
driver doing an orderly removal when the link is down. Device registers 
wouldn't be accessible in that case.

> However if there is a already a new card in the slot, PCI_DEV_DISCONNECTED
> will erroneously not be set.  Also, if card removal was triggered by the
> link going down but the card is still in the slot, PCI_DEV_DISCONNECTED
> will also erroneously not be set even though it's inaccessible.
> 
> The only situations when PCI_DEV_DISCONNECTED should not be set is if the
> card is being removed by sysfs or by the user pressing the Attention Button.
> Anything else is a surprise removal.  What we need to do is pass down a
> flag to pciehp_unconfigure_device() to indicate whether one of those two
> situations is at hand or not, and PCI_DEV_DISCONNECTED should be set
> depending on that flag.

That makes a little more sense. Is someone working on this?

> Are you testing Bjorn's pci/hotplug branch or something based on 4.18 or
> earlier?  There is a major event handling rework queued on that branch,
> so you may want to test with that.

Been using 4.18-rc6. I'll take a look at Bjorn's latest and greatest. 
Thank you for the heads-up.

>> In the latter case, does it not make sense to have a separate bit to say
>> "Don't touch this device"?
> 
> Hm, what would you need it for?

Idiotic firmware bugs and unrefined device removal mechanisms. For 
example, on removal, we shouldn't be touching MSI registers on a device 
whose link is down. Firmware-first (FFS) systems tend to blow up, crash, 
and burn when that happens.

Another example is NVME subsystem resets (NSSR). NSSR resets the PCIe 
link, and the PD state stays the same. Once teardown gets to masking the 
MSI-X bits, it sends an MMIO to the downed link.

I want to resolve such crashes by guarding teardown paths against 
touching registers on the disconnected device -- yes, I know it's a 
band-aid and not a cure.

Alex

> Thanks,
> 
> Lukas
> 




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux