On 27 Jul 2020, at 7:04, Jonathan Cameron wrote:
On Fri, 24 Jul 2020 10:22:18 -0700
Sean V Kelley <sean.v.kelley@xxxxxxxxx> wrote:
From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
Currently the kernel does not handle AER errors for Root Complex
integrated
End Points (RCiEPs)[0]. These devices sit on a root bus within the
Root Complex
(RC). AER handling is performed by a Root Complex Event Collector
(RCEC) [1]
which is a effectively a type of RCiEP on the same root bus.
For an RCEC (technically not a Bridge), error messages "received"
from
associated RCiEPs must be enabled for "transmission" in order to
cause a
System Error via the Root Control register or (when the Advanced
Error
Reporting Capability is present) reporting via the Root Error Command
register and logging in the Root Error Status register and Error
Source
Identification register.
In addition to the defined OS level handling of the reset flow for
the
associated RCiEPs of an RCEC, it is possible to also have a firmware
first
model. In that case there is no need to take any actions on the RCEC
because
the firmware is responsible for them. This is true where APEI [2] is
used
to report the AER errors via a GHES[v2] HEST entry [3] and relevant
AER CPER record [4] and Firmware First handling is in use.
We effectively end up with two different types of discovery for
purposes of handling AER errors:
1) Normal bus walk - we pass the downstream port above a bus to which
the device is attached and it walks everything below that point.
2) An RCiEP with no visible association with an RCEC as there is no
need to
walk devices. In that case, the flow is to just call the callbacks
for the actual
device.
A new walk function, similar to pci_bus_walk is provided that takes a
pci_dev
instead of a bus. If that dev corresponds to a downstream port it
will walk
the subordinate bus of that downstream port. If the dev does not then
it
will call the function on that device alone.
[0] ACPI PCI Express Base Specification 5.0-1 1.3.2.3 Root Complex
Integrated
Endpoint Rules.
[1] ACPI PCI Express Base Specification 5.0-1 6.2 Error Signalling
and Logging
[2] ACPI Specification 6.3 Chapter 18 ACPI Platform Error Interface
(APEI)
[3] ACPI Specification 6.3 18.2.3.7 Generic Hardware Error Source
[4] UEFI Specification 2.8, N.2.7 PCI Express Error Section
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
Signed-off-by: Sean V Kelley <sean.v.kelley@xxxxxxxxx>
---
...
pci_dbg(dev, "broadcast resume message\n");
- pci_walk_bus(bus, report_resume, &status);
+ pci_walk_dev_affected(dev, report_resume, &status);
- pci_aer_clear_device_status(dev);
- pci_aer_clear_nonfatal_status(dev);
This code had changed a little in Bjorn's pci/next branch so do a
rebase on that
before v2.
Will ensure rebase includes pci/next.
Thanks,
Sean
+ if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+ pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
+ pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)) {
+ pci_aer_clear_device_status(dev);
+ pci_aer_clear_nonfatal_status(dev);
+ }
pci_info(dev, "device recovery successful\n");
return status;