Hi Bjorn, Thanks for the review. On 3/29/23 3:09 PM, Bjorn Helgaas wrote: > [+cc Jonathan, author of 068c29a248b6] > > On Wed, Mar 15, 2023 at 04:54:49PM -0700, Kuppuswamy Sathyanarayanan wrote: >> Commit 068c29a248b6 ("PCI/ERR: Clear PCIe Device Status errors only if >> OS owns AER") adds support to clear error status in the Device Status >> Register(DEVSTA) only if OS owns the AER support. But this change >> breaks the requirement of the EDR feature which requires OS to cleanup >> the error registers even if firmware owns the control of AER support. >> >> More details about this requirement can be found in PCIe Firmware >> specification v3.3, Table 4-6 Interpretation of the _OSC Control Field. >> If the OS supports the Error Disconnect Recover (EDR) feature and >> firmware sends the EDR event, then during the EDR recovery window, OS >> is responsible for the device error recovery and holds the ownership of >> the following error registers. >> >> • Device Status Register >> • Uncorrectable Error Status Register >> • Correctable Error Status Register >> • Root Error Status Register >> • RP PIO Status Register >> >> So call pcie_clear_device_status() in edr_handle_event() if the error >> recovery is successful. > > IIUC, after ac1c8e35a326 ("PCI/DPC: Add Error Disconnect Recover (EDR) > support") appeared in v5.7-rc1, DEVSTA was always cleared in this path: > > edr_handle_event > pcie_do_recovery > pcie_clear_device_status > > After 068c29a248b6 ("PCI/ERR: Clear PCIe Device Status errors only if > OS owns AER") appeared in v5.9-rc1, we only clear DEVSTA if the OS > owns the AER Capability: > > edr_handle_event > pcie_do_recovery > if (pcie_aer_is_native(dev)) # <-- new test > pcie_clear_device_status > > So in the case where the OS does *not* own AER, and it receives an EDR > notification, DEVSTA is not cleared when it should be. Right? Correct. > > I assume we should have a Fixes: tag here, since this patch should be > backported to every kernel that contains 068c29a248b6. Possibly even > a stable tag, although it's arguable whether it's "critical" per > Documentation/process/stable-kernel-rules.rst. Yes. But this error is only reproducible in the EDR use case. So I am not sure whether it can be considered a critical fix. Fixes: 068c29a248b6 ("PCI/ERR: Clear PCIe Device Status errors only if OS owns AER") > >> Reported-by: Tsaur Erwin <erwin.tsaur@xxxxxxxxx> > > I assume this report was internal, and there's no mailing list post or > bugzilla issue URL we can include here? Yes. > >> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> >> --- >> >> Changes since v1: >> * Rebased on top of v6.3-rc1. >> * Fixed a typo in pcie_clear_device_status(). >> >> drivers/pci/pcie/edr.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c >> index a6b9b479b97a..87734e4c3c20 100644 >> --- a/drivers/pci/pcie/edr.c >> +++ b/drivers/pci/pcie/edr.c >> @@ -193,6 +193,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data) >> */ >> if (estate == PCI_ERS_RESULT_RECOVERED) { >> pci_dbg(edev, "DPC port successfully recovered\n"); >> + pcie_clear_device_status(edev); > > It's a little weird to work around a change inside pcie_do_recovery() > by clearing it here, and that means we clear it twice in the AER > native case, but I don't see any simpler way to do this, so this seems > fine as the fix for the current issue. In AER native case, edr_handle_event() will never be triggered. So it won't be cleared twice. Other way is to add a new parameter to pcie_do_recovery(..., edr) and use it to conditionally call pcie_clear_device_status(). But I think current way is less complex. edr_handle_event pcie_do_recovery(..., edr=1) if (pcie_aer_is_native(dev) || edr) pcie_clear_device_status > > Question though: in the AER native case, pcie_do_recovery() calls > both: > > pcie_clear_device_status() and > pci_aer_clear_nonfatal_status() > > In this patch, you only call pcie_clear_device_status(). Do you care > about pci_aer_clear_nonfatal_status(), too? Yes, we care about it. Since we call dpc_process_error() in EDR handler, it will eventually clear error status via pci_aer_clear_nonfatal_status() and pci_aer_clear_fatal_status() within dpc_process_error(). > > The overall design for clearing status has gotten pretty complicated > as we've added error handling methods (firmware-first, DPC, EDR), and > there are so many different places and cases that it's hard to be sure > we do them all correctly. > > I don't really know how to clean this up, so I'm just attaching my > notes about the current state: Good summary! I can see a lot of overlap in clearing PCI_ERR_UNCOR_STATUS and PCI_EXP_DEVSTA. > > - AER native handling: > > handle_error_source > if (info->severity == AER_CORRECTABLE) > clear PCI_ERR_COR_STATUS <-- > if (pcie_aer_is_native(dev)) > pdrv->err_handler->cor_error_detected() > pcie_clear_device_status > clear PCI_EXP_DEVSTA <-- > else > pcie_do_recovery > pcie_clear_device_status > clear PCI_EXP_DEVSTA <-- > pci_aer_clear_nonfatal_status > clear PCI_ERR_UNCOR_STATUS <-- > > - Firmware-first handling: status is cleared by firmware before > event is reported to OS via HEST > > - DPC native handling: > > dpc_handler > dpc_process_error > if (rp_extensions) > dpc_process_rp_pio_error > clear PCI_EXP_DPC_RP_PIO_STATUS <-- > else if (...) > pci_aer_clear_nonfatal_status > clear PCI_ERR_UNCOR_STATUS <-- > pci_aer_clear_fatal_status > clear PCI_ERR_UNCOR_STATUS <-- > pcie_do_recovery > if (AER native) > pcie_clear_device_status > clear PCI_EXP_DEVSTA <-- > pci_aer_clear_nonfatal_status > clear PCI_ERR_UNCOR_STATUS <-- > > - EDR event handling: > > edr_handle_event > dpc_process_error > if (rp_extensions) > dpc_process_rp_pio_error > clear PCI_EXP_DPC_RP_PIO_STATUS <-- > else if (...) > pci_aer_clear_nonfatal_status > clear PCI_ERR_UNCOR_STATUS <-- > pci_aer_clear_fatal_status > clear PCI_ERR_UNCOR_STATUS <-- > pci_aer_raw_clear_status > clear PCI_ERR_ROOT_STATUS <-- > clear PCI_ERR_COR_STATUS <-- > clear PCI_ERR_UNCOR_STATUS <-- > pcie_do_recovery > if (AER native) > pcie_clear_device_status > clear PCI_EXP_DEVSTA <-- > pci_aer_clear_nonfatal_status > clear PCI_ERR_UNCOR_STATUS <-- > if (PCI_ERS_RESULT_RECOVERED) > pcie_clear_device_status > clear PCI_EXP_DEVSTA <-- > >> acpi_send_edr_status(pdev, edev, EDR_OST_SUCCESS); >> } else { >> pci_dbg(edev, "DPC port recovery failed\n"); >> -- >> 2.34.1 >> -- Sathyanarayanan Kuppuswamy Linux Kernel Developer