[+cc Austin, _OSC expert] On Mon, Jun 22, 2020 at 07:35:23PM +0800, Jonathan Cameron wrote: > pci_aer_clear_device_status() currently resets the device status > (PCI_EXP_DEVSTA) on the downstream port above a device, or the port itself > if the port is the reported AER error source. This happens even when error > handling is firmware first. > > Our interpretation is that firmware first handling means that the firmware > will deal with clearing all relevant error reporting registers > including this one. IMO "firmware-first" is meaningless to the kernel. I see the bit defined in the ACPI HEST records (ACPI v6.3, sec 18.3.2.4), but there is no indication of anything the OS needs to *do* with it. It does not influence the result of pcie_aer_is_native(). So I don't want to mention it in the subject or commit log. But I think what the _OSC negotiation for AER ownership is relevant, and that's what your patch tests, so I think this is the right thing to do. So I applied this as below to pci/error for v5.8, thanks a lot! Oh, I also propose a preliminary patch (posted and cc'd to you) to rename pci_aer_clear_device_status(): https://lore.kernel.org/r/20200717195619.766662-1-helgaas@xxxxxxxxxx commit d6c8d24e3d5d ("PCI/ERR: Clear PCIe Device Status errors only if OS owns AER") Author: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> Date: Mon Jun 22 19:35:23 2020 +0800 PCI/ERR: Clear PCIe Device Status errors only if OS owns AER pcie_clear_device_status() resets the error bits in the PCIe Device Status Register (PCI_EXP_DEVSTA). Previously we did this unconditionally, but on ACPI systems, the _OSC AER bit negotiates control of the AER capability. Per sec 4.5.1 of the System Firmware Intermediary _OSC and DPC Updates ECN [1], this bit also covers other error enable/status bits including the following: Correctable Error Reporting Enable Non-Fatal Error Reporting Enable Fatal Error Reporting Enable Unsupported Request Reporting Enable These bits are all in the PCIe Device Control register (the ECN omitted "Reporting", but I think that's a typo), so by implication the _OSC AER bit also applies to the error status bits in the PCIe Device Status register: Correctable Error Detected Non-Fatal Error Detected Fatal Error Detected Unsupported Request Detected Clear the PCIe Device Status error bits only when the OS controls the AER capability and related error enable/status bits. If platform firmware controls the AER capability, firmware is responsible for clearing these bits. One call path leading here is: ghes_do_proc ghes_handle_aer aer_recover_queue schedule_work(&aer_recover_work) ... aer_recover_work_func pcie_do_recovery pcie_clear_device_status [1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24, 2020, affecting PCI Firmware Specification, Rev. 3.2 https://members.pcisig.com/wg/PCI-SIG/document/14076 [bhelgaas: commit log] Link: https://lore.kernel.org/r/20200622113523.891666-1-Jonathan.Cameron@xxxxxxxxxx Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index d3ea667c8520..34bfea5c52b3 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -245,6 +245,9 @@ void pcie_clear_device_status(struct pci_dev *dev) { u16 sta; + if (!pcie_aer_is_native(dev)) + return; + pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta); pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta); }