On 07/31/2018 04:29 AM, Lukas Wunner wrote: > On Mon, Jul 30, 2018 at 09:38:04PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote: >> On 07/28/2018 01:31 PM, Lukas Wunner wrote: >>> On Fri, Jul 27, 2018 at 05:51:04PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote: >>>> I think PCI_DEV_DISCONNECTED is a documentation issue above all else. >>>> The history I was given is that drivers would take a very long time to >>>> tear down a device. Config space IO to an nonexistent device took a long >>>> while to time out. Performance was one motivation -- and was not >>>> documented. >>> >>> Often it is possible for the driver to detect surprise removal by >>> checking if mmio reads return "all ones". But in some cases that's >>> a valid value to read from mmio and then this approach won't work. >>> Also, checking every mmio read may negatively impact performance. >> >> A colleague and me beat that dead horse to the afterdeath. Consensus was >> that the return value is less reliable than a coin toss (of a two-heads >> coin). > > Can you elaborate why? Because the "official" stance is that checking > every read where "all ones" is an invalid value is the proper way to > detect unplugged devices. (Official as in, voiced by Greg KH and Bjorn.) > In that sense, PCI_DEV_DISCONNECTED is sort of an unloved child. All ones is not necessarily invalid. The bug surface is every single config read. This approach doesn't even cover config writes -- config writes are non-posted requests too in PCIe. "Build it, and they will come". That means that drivers would expect -ENODEV when a device is gone. If we have that infrastructure, more drivers will start using it over time, and it's something that can also be used by generic parts of the PCI code. That also means you need a generic mechanism to determine a device is bye-bye, and that's what PCI_DEV_DISCONNECTED gives you. > See this thread: > https://www.spinics.net/lists/linux-acpi/msg81445.html The discussion is based on the wrong assumptions that reads are symmetrical wrt to a device being present or not. However, completion timeouts and unsupported requests blow that assumption right out of the water. Best case scenario, you just waste a little more time waiting for hardware IO. More common is to end up crashing the machine. Greg's ideas work in a perfect world where PCI and PCIe are equivalent at every level. And in that case, PCI_DEV_DISCONNECTED would have been pure, 100% genuine Redmond bloatware. We wouldn't have gotten complaints from Facebook and other industry players that it takes too damn long to remove a device. We probably also wouldn't be seeing machines crash on PCIe removal. Fun fact: Before PCI_DEV_DISCONNECTED, you could physically swap a device before the the teardown path was done with the previous device. Figuring out what problems that caused is left as an exercise to the reader. >>> FWIW, the below is what I had in mind (on top of Bjorn's pci/hotplug >>> branch). Does this work for you? >> >> This, and another patch (you have been CC'd) solve my problem of >> crashing during surprise removal. Thanks! > > Ok thanks, I submitted the patch this morning with your Tested-by. Sweet. Thanks! > Unfortunately I forgot to cc all your Dell colleagues, sorry. They'll live. They already noticed it and sent me emails about it. Alex > Lukas >