From: Alex_Gagniuc@xxxxxxxxxxxx > Sent: 31 July 2018 17:36 > > On 07/31/2018 04:29 AM, Lukas Wunner wrote: > > On Mon, Jul 30, 2018 at 09:38:04PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote: > >> On 07/28/2018 01:31 PM, Lukas Wunner wrote: > >>> On Fri, Jul 27, 2018 at 05:51:04PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote: > >>>> I think PCI_DEV_DISCONNECTED is a documentation issue above all else. > >>>> The history I was given is that drivers would take a very long time to > >>>> tear down a device. Config space IO to an nonexistent device took a long > >>>> while to time out. Performance was one motivation -- and was not > >>>> documented. > >>> > >>> Often it is possible for the driver to detect surprise removal by > >>> checking if mmio reads return "all ones". But in some cases that's > >>> a valid value to read from mmio and then this approach won't work. > >>> Also, checking every mmio read may negatively impact performance. > >> > >> A colleague and me beat that dead horse to the afterdeath. Consensus was > >> that the return value is less reliable than a coin toss (of a two-heads > >> coin). Something cheap-ish to find out whether a -1 was caused by a card removal might be sensible - Especially if it can be done without a config space read. Clearly you can't check anything BEFORE doing the read. And reading the pci-id from config space isn't entirely useful. If the card has reset itself (and the link recovered) then you need to read a BAR register and check it is setup. More interestingly a read request that is inside the bridge's address window but outside any BAR (fairly easy to setup if the target has a large BAR and a small one) will also timeout (and return -1) even though there is no failure of the link. If the target supports AER the information about the failed cycle ends up in the target's AER registers - even if the host bridge doesn't support AER (or it is being ignored). So it might be useful being able to read the AER registers even when no AER interrupt (or other notification) actually happens. I've not managed to get linux to pick up AER interrupts even on systems where the hardware clearly supports them (at least on some slots). I suspect the BIOS is carefully disabling them because of reports of message logs being spammed with AER errors. We also have one system (possibly a Dell 740) where any failure of a PCIe link leads to an NMI and a kernel crash! Not entirely useful in a server model that is supposed to have resilience against various errors. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)