On Tue, Aug 21, 2018 at 08:04:46AM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2018-08-20 at 18:02 -0400, Sinan Kaya wrote: > > On 8/20/2018 5:53 PM, Benjamin Herrenschmidt wrote: > > > > > Hotplug driver removes the devices on link down events and re-enumerates > > > > > on insertion. > > > > > > > > > > I am trying to separate fatal error handling from hotplug. > > > > > > > > I'll try to take a look. We can't always count on pciehp to do the > > > > removal when a removal occurs, though. The PCIe specification contains > > > > an implementation note that DPC may be used in place of hotplug surprise. > > > > > > Can't you use the presence detect to differenciate ? > > > > > > Also, I don't have the specs at hand right now, but does the hotplug > > > brigde have a way to "latch' the change in presence detect so we can > > > see if it has transitioned even if it's back on ? > > > > There is only presence detect change and link layer change. No actual > > state information. > > It does latch that it has changed tho right ? So if presence detect > hasn't changed, we can assume it's an error and not an unplug ? > > We could discriminate that way to reduce the risk of doing a recovery > without unbind on something that was actually removed and replaced. There might be some potential here. There actualy is a Slot Status Presence Detect State (PDS), but it isn't latched. The link and presence change (PDC) are latched. A potential problem may be a PDC event handling observes PDS set to the last known state if the remove + add happens fast enough, and then no action is taken in the current implementation. The DPC capability learned from such issues, and they will latch the status until acknowledged so those sorts of races are not possible. But anyway, maybe we can change pciehp to ignore PDS and always re-enumerate when PDC is observed, and that may very well simplify a lot of this.