Hi Bjorn and others,
Could you please share your thoughts on this?
Thanks,
Vidya Sagar
On 11/10/2023 10:31 PM, Vidya Sagar wrote:
There was a typo. Corrected it.
s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also
On 11/10/2023 6:30 PM, Vidya Sagar wrote:
Hi folks,
Here are the platform details.
- System with a Firmware First approach for both AER and DPC
- CPERs are sent to the OS for the AER events
- DPC is notified to the OS through EDR mechanism
- System doesn't have support for in-band PD and supports only OOB PD
where writing to a private register would set the PD state
- System has this design where PD state gets cleared whenever there is
a link down (without even writing to the private register)
In this system, if the root port has to experience a DPC because of
any right reasons (say SW trigger of DPC for time being), I'm
observing the following flow which is causing a race.
1. Since DPC brings the link down, there is a DLLSC and an MSI is sent
to the OS hence PCIe HP ISR is invoked.
2. ISR then takes stock of the Slot_Status register to see what all
events caused the MSI generation.
3. It sees both DLLSC and PDC bits getting set.
4. PDC is set because of the aforementioned hardware design where for
every link down, PD state gets cleared and since this is considered as
a change in the PD state, PDC also gets set.
5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt
thread/bottom half) and waits for the DPC_EDR to complete (because
DLLSC is one of the events)
6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the
firmware and that would then send an EDR event to the OS. Firmware
also sets the PD state to '1' before that, as the endpoint device is
still available.
7. Firmware programming of the private register to set the PD state
raises second interrupt and PCIe HP ISR takes stock of the events and
this time it is only the PDC and not DLLSC. ISR sends a wake to the
IST, but since there is an IST in progress already, nothing much
happens at this point.
8. Once the DPC is completed and link comes back up again, DPC
framework asks the endpoint drivers to handle it by calling the
'pci_error_handlers' callabacks registered by the endpoint device
drivers.
9. The PCIe HP IST (of the very first MSI) resumes after receiving the
completion from the DPC framework saying that DPC recovery is successful.
10. Since PDC (Presence Detect Change) bit is also set for the first
interrupt, IST attempts to remove the devices (as part of
pciehp_handle_presence_or_link_change())
At this point, there is a race between the device driver that is
trying to work with the device (through pci_error_handlers callback)
and the IST that is trying to remove the device.
To be fair to pciehp_handle_presence_or_link_change(), after removing
the devices, it checks for the link-up/PD being '1' and scans the
devices again if the device is still available. But unfortunately, IST
is deadlocked (with the device driver) while removing the devices
itself and won't go to the next step.
The rationale given in the form of a comment in
pciehp_handle_presence_or_link_change() for removing the devices first
(without checking PD/link-up) and adding them back after checking
link-up/PD is that, since there is a change in PD, the new link-up
need not be with the same old device rather it could be with a
different device. So, it justifies in removing the existing devices
and then scanning for new ones. But this is causing deadlock with the
already initiated DPC recovery process.
I see two ways to avoid the deadlock in this scenario.
1. When PCIe HP IST is looking at both DLLSC and PDC, why should
DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks
'PD Change' also (along with DLL) and declares that the DPC recovery
can't happen (as there could be a change of device) hence returning
DPC recovery failure, then, the device driver's pci_error_handlers
callbacks won't be called and there won't be any deadlock.
2. Check for the possibility of a common lock so that PCIe HP IST and
device driver's pci_error_handlers callbacks don't race.
Let me know your comments on this.
Thanks,
Vidya Sagar