My assumption was this sequence is something like this, where firmware *can't* collect error status from devices below the Downstream Port because DPC has been triggered and they are not accessible: - Hardware triggers DPC in a Downstream Port - Firmware fields error interrupt - Firmware captures Downstream Port error info (devices below are not accessible because of DPC) - Firmware sends EDR Notify to OS - OS brings Downstream Port out of DPC - OS collects error status from devices below Downstream Port - OS evaluates _OST - Firmware captures error status from devices below Downstream Port MN: The above flow is correct. The error registers on the device are sticky, so they should survive DPC (=hot reset). But that doesn't explain why *firmware* could not clear the error status of those devices after it captures it. MN: Again you are right. There is no reason why firmware could not clear error status of the device after the link has been brought out of DPC, but after a lot of back and forth, it was decided that OS will clear the error register on the device during DPC recovery=success path. This was more than 4 years ago and I honestly don't remember why we went this way. I guess the flowchart *does* show firmware clearing the error status in the "do not continue recovery" path. Thank you, Mahesh -----Original Message----- From: Bjorn Helgaas <helgaas@xxxxxxxxxx> Sent: Thursday, April 06, 2023 3:22 PM To: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> Cc: Natu, Mahesh <mahesh.natu@xxxxxxxxx>; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx Subject: Re: [PATCH v2] PCI/EDR: Clear PCIe Device Status errors after EDR error recovery On Thu, Apr 06, 2023 at 02:52:02PM -0700, Sathyanarayanan Kuppuswamy wrote: > On 4/6/23 2:07 PM, Bjorn Helgaas wrote: > > On Wed, Mar 15, 2023 at 04:54:49PM -0700, Kuppuswamy Sathyanarayanan wrote: > >> Commit 068c29a248b6 ("PCI/ERR: Clear PCIe Device Status errors only > >> if OS owns AER") adds support to clear error status in the Device > >> Status > >> Register(DEVSTA) only if OS owns the AER support. But this change > >> breaks the requirement of the EDR feature which requires OS to > >> cleanup the error registers even if firmware owns the control of AER support. > >> > >> More details about this requirement can be found in PCIe Firmware > >> specification v3.3, Table 4-6 Interpretation of the _OSC Control Field. > >> If the OS supports the Error Disconnect Recover (EDR) feature and > >> firmware sends the EDR event, then during the EDR recovery window, > >> OS is responsible for the device error recovery and holds the > >> ownership of the following error registers. > >> > >> • Device Status Register > >> • Uncorrectable Error Status Register • Correctable Error Status > >> Register • Root Error Status Register • RP PIO Status Register > >> > >> So call pcie_clear_device_status() in edr_handle_event() if the > >> error recovery is successful. > >> > >> Reported-by: Tsaur Erwin <erwin.tsaur@xxxxxxxxx> > >> Signed-off-by: Kuppuswamy Sathyanarayanan > >> <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> > >> --- > >> > >> Changes since v1: > >> * Rebased on top of v6.3-rc1. > >> * Fixed a typo in pcie_clear_device_status(). > >> > >> drivers/pci/pcie/edr.c | 1 + > >> 1 file changed, 1 insertion(+) > >> > >> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c index > >> a6b9b479b97a..87734e4c3c20 100644 > >> --- a/drivers/pci/pcie/edr.c > >> +++ b/drivers/pci/pcie/edr.c > >> @@ -193,6 +193,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data) > >> */ > >> if (estate == PCI_ERS_RESULT_RECOVERED) { > >> pci_dbg(edev, "DPC port successfully recovered\n"); > >> + pcie_clear_device_status(edev); > >> acpi_send_edr_status(pdev, edev, EDR_OST_SUCCESS); > > > > The implementation note in PCI Firmware r3.3, sec 4.6.12, shows the > > OS clearing error status *after* _OST is evaluated. > > > > On the other hand, the _OSC DPC control bit in table 4-6 says that > > if the OS does not have DPC control, it can only write the Device > > Status error bits between the EDR Notify and invoking _OST. > > > > Is one of those wrong, or am I missing something? > > Agree. It is conflicting info. IMO, the argument that the OS is > allowed to clear the error registers during the EDR windows makes more > sense. If OS is allowed to touch error registers owned by firmware > after that window, it would lead to race conditions. > > Mahesh, let us know your comments. Maybe we need to fix this in the > firmware specification. My assumption was this sequence is something like this, where firmware *can't* collect error status from devices below the Downstream Port because DPC has been triggered and they are not accessible: - Hardware triggers DPC in a Downstream Port - Firmware fields error interrupt - Firmware captures Downstream Port error info (devices below are not accessible because of DPC) - Firmware sends EDR Notify to OS - OS brings Downstream Port out of DPC - OS collects error status from devices below Downstream Port - OS evaluates _OST - Firmware captures error status from devices below Downstream Port But that doesn't explain why *firmware* could not clear the error status of those devices after it captures it. I guess the flowchart *does* show firmware clearing the error status in the "do not continue recovery" path.