On Wed, Mar 17, 2021 at 01:02:07PM -0700, Kuppuswamy, Sathyanarayanan wrote: > On 3/17/21 12:01 PM, Lukas Wunner wrote: > > If the events are ignored, the driver of the device in the hotplug slot > > is not unbound and rebound. So the driver must be able to cope with > > loss of TLPs during DPC recovery and it must be able to cope with > > whatever state the endpoint device is in after DPC recovery. > > Is this really safe? How does the nvme driver deal with it? > > During DPC recovery, in pcie_do_recovery() function, we use > report_frozen_detected() to notify all devices attached to the port > about the fatal error. After this notification, we expect all > affected devices to halt its IO transactions. > > Regarding state restoration, after successful recovery, we use > report_slot_reset() to notify about the slot/link reset. So device > drivers are expected to restore the device to working state after this > notification. Thanks a lot for the explanation. > I am not sure how pure firmware DPC recovery works. Is there a platform > which uses this combination? For firmware DPC model, spec does not clarify > following points. > > 1. Who will notify the affected device drivers to halt the IO transactions. > 2. Who is responsible to restore the state of the device after link reset. > > IMO, pure firmware DPC does not support seamless recovery. I think after it > clears the DPC trigger status, it might expect hotplug handler be responsible > for device recovery. > > I don't want to add fix to the code path that I don't understand. This is the > reason for extending this logic to pure firmware DPC case. I agree, let's just declare synchronization of pciehp with pure firmware DPC recovery as unsupported for now. I've just submitted a refined version of my patch to the list: https://lore.kernel.org/linux-pci/b70e19324bbdded90b728a5687aa78dc17c20306.1616921228.git.lukas@xxxxxxxxx/ If you could give this new version a whirl I'd be grateful. This version contains more code comments and kernel-doc. There's now a check in dpc_completed() whether the DPC Status register contains "all ones", which can happen when a DPC-capable hotplug port is hot-removed, i.e. for cascaded DPC-capable hotplug ports. I've also realized that the previous version was prone to races which are theoretical but should nonetheless be avoided: E.g., previously the DLLSC event was only removed from "events" if the link is still up after DPC recovery. However if DPC triggers and recovers multiple times in a row, the link may happen to be up but a new DLLSC event may have been picked up in "pending_events" which should be ignored. I've solved this by inverting the logic such that DLLSC is *always* removed from "events", and if the link is unexpectedly *down* after successful recovery, a DLLSC event is synthesized. Thanks, Lukas