Re: [PATCH v2] PCI/AER: Add support for reset of RCiEPs for APEI/Firmware first reporting only

"Sean V Kelley" <sean.v.kelley@xxxxxxxxxxxxxxx> · Tue, 07 Jul 2020 16:40:58 -0700

On 7 Jul 2020, at 9:56, Sean V Kelley wrote:

On 3 Jul 2020, at 1:23, Jonathan Cameron wrote:

On Thu, 2 Jul 2020 11:06:26 -0700
Sean V Kelley <sean.v.kelley@xxxxxxxxxxxxxxx> wrote:

On 26 Jun 2020, at 11:41, Jonathan Cameron wrote:

On Fri, 26 Jun 2020 09:29:34 -0700
Sean V Kelley <sean.v.kelley@xxxxxxxxxxxxxxx> wrote:

Hi,
Hi,

Thanks for taking a look.


On 22 Jun 2020, at 4:44, Jonathan Cameron wrote:

Was previously: PCI/AER: Add partial initial supprot for
RCiEPs
using
RCEC or
firmware first.

Currently the kernel does not handle AER errors for Root
Complex
integrated
End Points (RCiEPs)[0].  These devices sit on a root bus
within
the
Root Complex
(RC).  AER handling is performed by a Root Complex Event
Collector
(RCEC) [1]
which is a effectively a type of RCiEP on the same root bus.

This code will only perform the correct reset flow for the
case
where
there
is no need to take any actions on the RCEC because the
firmware is
responsible for them.   This is true where APEI [2] is used
to
report
the AER
errors via a GHES[v2] HEST entry [3] and relevant AER CPER
record
[4]
and Firmware
First handling is in use.

Right, in the case of the RCEC one identifies the RCiEPs by
the
RCiEP
bitmap as a part of the RCEC Associated Endpoint Extended
Capabilities.
This ‘search’ so to speak would make use also of the RCEC
Associated
Bus Numbers Register to associate the devices with an RCEC when
not
on
the same bus.

Ah. I'm afraid my access to recent specs is a bit limited at the
moment.
I do have a draft 5.0 spec which has that in though so I now see
what
you mean.

Was introduced in Root Complex Event Collector Endpoint
Association
Extended
Capability version 2 in PCIe 5.0 I think.


Correct.

As there is no current RCEC driver support, it should not be
possible
to get
to this code via any routes other than the one above. Hence
appropriate RCEC
handling can be added when the RCEC driver support is ready.
The error handling is different from a normal PCIe End Point
because:

1) There is no downstream port above an RCiEP as these
devices sit
on
a root
   bus.

2) In general, it makes little sense to reset other devices
on on
the
same
   root bus.  For error handling outside the of the root
complex
(RC)
an AER
   error will indicate that all the topology below the
physical
link,
which
   the error is related to, will need to be reset as they
share a
common
   path to the host.  For an RCiEP there is no such defined
shared
path
   relationship with other elements on the root bus.

A new walk function, similar to pci_bus_walk is provided
that
takes
a
pci_dev
instead of a bus.  If that dev corresponds to a downstream
port it
will walk
the subordinate bus of that downstream port.  If the dev does
not
then
it
will call the function on that device alone.   This function
allows
us
to
avoid adding special cases to the majority of the error
handling.

Then in that case the callback could add the additional checks
specific
to identifying the associated RCiEPs.

I am afraid I don't follow what you mean here.  Could you give
more
info?

Sure, a given RCEC can be associated with multiple RCiEPs.  As a
part
of
the Extended Association Cap it is possible to obtain a bitmap of
the
RCiEP device ids on the same bus number as the RCEC device itself.
(5.0-1.0 sec 7.9.10.2).  With a Cap version of 2h or higher, it is
also
possible to get an additional range of bus numbers containing
RCiEPs
also associated with this RCEC.

So I’m wondering if this function could be used in which passing a
dev, in this case the RCEC, triggers the call back which makes use
of
the RCiEP bitmap and associated bus ranges to return all identified
devices in use cases such as in AER for finding sources, etc.

Ah understood.

If we do this we effectively end up with 3 different types of walk
and
the meaning of the walk function gets more complex again.

1) Normal bus walk - we pass the downstream port above a bus to which
the device is attached and it walks everthing below that point.

2) Case I care about RCiEP with no visible association with an RCEC
as
I don't need to walk devices.  In that case just calls the callbacks
for
the actual device.

3) Pass in RCiEP with RCEC asociated with it (or do a dance at the
caller
to pass in the RCEC itself). Need to walk the devices that the RCEC
is
handling errors for.  For handling, I'm not all the calls will be
generally
applicable to other devices associated with the RCEC as some only
make
sense if there is an actual PCIe bus involved and hence we need to
reset
other devices on that bus.  For RCEC I don't think there is an
particular
reason to assume an AER error reported at one RCiEP will have any
impact
on other devices associated with the particular RCEC.
I've not found anything in the spec addressing this question but
perhaps
I've missed something?

Correct.  There should be no impact to the RCEC or its associated
RCiEPs which may not happen to reside on the same bus as the collector 
at all.

However, if the RCEC doesn't support multiple error records, you may
need
to walk the bus to identify multiple simultaneous issues, very
carefully
avoiding (or least minimizing) race conditions.

The alternative is to have a separate walk for RCECs that loops
through
the bitmap / ranges (if supported) triggering the callback for each
device found.

I’ve been testing the Associated Endpoint Bitmap and Bus Range
handling and using my pciutils patches to help to confirm some of the
association. This overlaps with my CXL work and CXL 1.1 based RCiEPs
are good test cards:

(Decode via:
https://lore.kernel.org/linux-pci/20200624223940.240463-1-sean.v.kelley@xxxxxxxxxxxxxxx/
 )

Test card at 6b:00.0

Capabilities: [e00 v1] Designated Vendor-Specific: Vendor=1e98 ID=0000
Rev=0 Len=56: CXL
CXLCap: Cache- IO+ Mem+ Mem HW Init+ HDMCount 1 Viral-
CXLCtl: Cache- IO+ Mem- Cache SF Cov 0 Cache SF Gran 0 Cache Clean-
Viral-
CXLSta: Viral-
Capabilities: [e38 v1] Device Serial Number 30-91-11-78-10-00-00-00

RCEC assocated to RCiEP at 6b while residing at 6a:

Capabilities: [160 v2] Root Complex Event Collector Endpoint
Association
RCiEPBitmap: 00000000 [none]
AssociatedBusNumbers: 6b-6b
Kernel driver in use: pcieport

with dmesg:

[ 10.502543] pcieport 0000:6a:00.4: AER: enabled with IRQ 34

The trick is the walk, which is not compact.  Currently working on
error injection to test:

void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *,
void *),
+                   void *userdata)
+{
+       u32 pos, bitmap, hdr, busn;
+       u8 ver, nextbusn, lastbusn;
+       unsigned int dev, fn, bnr;
+       struct pci_bus *pbus=NULL;
+       struct pci_dev *pdev;
+       int retval;
+
+       pos = pci_find_ext_capability(rcec, PCI_EXT_CAP_ID_RCEC);
+       if (!pos)
+               return;
+
+       pci_read_config_dword(rcec, pos + PCI_RCEC_RCIEP_BITMAP,
&bitmap);
+
+       for (dev = 0; dev < 32; dev++) {
+               if (!(bitmap & (1 << dev)))
+                       continue;
+
+               for (fn = 0; fn < 8; fn++) {
+                       pdev =
pci_get_domain_bus_and_slot(pci_domain_nr(rcec->bus),
+                                                          rcec->bus-
number,
+                                                          
PCI_DEVFN(de
v, fn));
+                       if (!pdev)
+                               continue;
+
+                       retval = cb(pdev, userdata);
+                       if (retval)
+                               return;
+               }
+       }

Then continuing in the same function above, I need to also consider 
the
case for the Bus ranges (still wip, not tested):

+       pci_read_config_dword(rcec, pos, &hdr);
+       ver = PCI_RCEC_EP_CAP_VER(hdr);
+       if (ver < PCI_RCEC_BUSN_REG_VER)
+               return;
+
+       pci_read_config_dword(rcec, pos + PCI_RCEC_BUSN, &busn);
+       nextbusn = PCI_RCEC_BUSN_NEXT(busn);
+       lastbusn = PCI_RCEC_BUSN_LAST(busn);
+
+       if ((nextbusn == 0xff) && (lastbusn == 0x00))
+               return;
+
+       for (bnr = nextbusn; bnr < (lastbusn + 1); bnr++) {
+               pbus = pci_find_bus(pci_domain_nr(rcec->bus), bnr);
+               if (pbus) {
+                       /* find RCiEP devices on the given bus */
+                       for (dev = 0; dev < 32; dev++) {
+ etc...
+                       }
+               }
+       }
+}

Currently this lives in aer.c and it's large enough that I wonder if
due to the specifity of the assoicated spec requirments if that should
be fine?

Agreed. We would end up with the same splitting of handling paths
that
wasn't liked in my v1 patch.   Perhaps we need  single
pci_walk_aer_affected
function with a pile of documentation for what it is actually doing?

Perhaps this would be a good start?  I tend to agree.  I can also
submit more of the patches as RFC for further comment.

Thinking further on the matter and experimenting, I think a 
pci_walk_aer_device, or however we choose to call it, should be possible 
and cleaner when you consider the three cases.  The third case is what I 
illustrated pasted in the previous reply.  PME will need to likewise do 
a walk, in the case of RCEC.  However, the implementation will need to 
be different though it could share some common code for the third case.

Sean


Even then we may need to have a parameter to indicate a particular
callback
should be restricted to devices that share a 'real bus' or not.

In my case 'walk' through the spec options of either just bitmap or if
of sufficient version (2h), I walk through the bus ranges, calling the
callback at each encounter.


I guess the proof will as ever be in what the code looks like.

Agreed.


Hmm. It increasingly feels like we may need to have a go at drawing
together
some coherent documentation for the different ways of handling
AER errors and specifications / assumptions for each.  Would be
'interesting'
to do given I'm fairly sure very few people actually understands all
the options and
nasty corner cases!

+CC Lorenzo and James who may also be interested in this topic in
general.

Let's do.

Best regards,

Sean


Thanks,

Jonathan


Thanks,

Sean

Open questions:

1. Are we better protecting against link reset for an RCiEP
in
here
or
   should we put the check in the link reset functions?

2. If we were to get a stupid firmware record with the
relevant
reset
flag
   set to trigger a link reset, what is the correct
response?  For
now
I
   try to report that we haven't done anything and print a
warning.

3. Naming of pci_walk_below_dev is rather unsatisfying. Any
better
ideas?

4. pci_walk_below_dev is perhaps not of general utility.
Shall I
make
it local
   in err.c?  If not would a precursor patch for that be
preferred?’

It depends.  Is it intended as a drop in replacement where
needed
for
pci_walk_bus()? So in that case you are now passing the dev
structure
and do the check for subordinate or is it intended as being
specific
to
say RCEC? With AER, one could either first check for RC_EC type
before
using this one.  Or one could just drop in replace (passing the
dev
structure instead) and the call back performs the RCEC specific
checks
when a device is encountered.

If it is useful in aer.c that's great.   Just seemed such a weird
beast
I wasn't sure it would be of use anywhere else.

Testing has been performed via error injection on a QEMU
platform
as
that lets
me create a wide range of topologies and report errors at
any
chosen
location.
Currently I have no plans to upstream this injection support,
but
am
happy to
share if useful to others.

I’m experimenting with it in my RCEC code in AER and will give
you
additional feedback.

Great, thanks

Jonathan

Thanks,

Sean


[0] ACPI PCI Express Base Specification 4.0 1.3.2.3 Root
Complex
Integrated
    Endpoint Rules.
[1] ACPI PCI Express Base Specification 4.0 6.2 Error
Signalling
and
Logging
[2] ACPI Specification 6.3 Chapter 18 ACPI Platform Error
Interface
(APEI)
[3] ACPI Sepcification 6.3 18.2.3.7 Generic Hardware Error
Source
[4] UEFI Specification 2.8, N.2.7 PCI Express Error Section

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
---
Changes since v1:
* Separated from the largely unrelated fix so the two can
move
forwards separately.
* Instead of separate path for RCiEP handling use the method
suggested
by Bjorn
  and Sathyanarayanan with an adjusted pci_bus_walk.

Thanks all for reviews of V1.

 drivers/pci/bus.c      | 28 ++++++++++++++++++++++++++++
 drivers/pci/pcie/err.c | 29 +++++++++++++++++++----------
 include/linux/pci.h    |  2 ++
 3 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 8e40b3e6da77..7cbe1ed2db3d 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -411,6 +411,34 @@ void pci_walk_bus(struct pci_bus *top,
int
(*cb)(struct pci_dev *, void *),
 }
 EXPORT_SYMBOL_GPL(pci_walk_bus);

+/** pci_walk_below_dev - walk devices below (or on) another
device
+ *  @dev      device for which we should walk below,
include
device
when not a port.
+ *  @cb       callback to be called for each device found
+ *  @userdata arbitrary pointer to be passed to callback.
+ *
+ *  If the device provided is a port,
+ *  walk the subordinate bus, including any bridged devices
+ *  on buses under this bus.  Call the provided callback
+ *  on each device found.
+ *
+ *  If the device provided hs no subordinate bus, call the
provided
+ *  callback on the device itself.
+ *
+ */
+void pci_walk_below_dev(struct pci_dev *dev, int
(*cb)(struct
pci_dev
*, void *),
+			void *userdata)
+{
+	struct pci_bus *bus;
+
+	if (dev->subordinate) {
+		bus = dev->subordinate;
+		pci_walk_bus(bus, cb, userdata);
+	} else {
+		cb(dev, userdata);
+	}
+}
+EXPORT_SYMBOL_GPL(pci_walk_below_dev);
+
 struct pci_bus *pci_bus_get(struct pci_bus *bus)
 {
 	if (bus)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..fa08b1cc3d96 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -151,33 +151,39 @@ pci_ers_result_t
pcie_do_recovery(struct
pci_dev
*dev,
 			pci_ers_result_t (*reset_link)(struct
pci_dev *pdev))
 {
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
-	struct pci_bus *bus;

 	/*
 	 * Error recovery runs on all subordinates of the
first
downstream
port.
 	 * If the downstream port detected the error, it is
cleared at
the
end.
+	 * For RCiEPs we should reset just the RCiEP itself.
 	 */
 	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
-	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
+	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
+	      pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END))
 		dev = dev->bus->self;
-	bus = dev->subordinate;

 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
-		pci_walk_bus(bus, report_frozen_detected,
&status);
+		pci_walk_below_dev(dev, report_frozen_detected,
&status);
+		if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END)
{
+			pci_warn(dev, "link reset not possible
for RCiEP\n");
+			status = PCI_ERS_RESULT_NONE;
+			goto failed;
+		}
+
 		status = reset_link(dev);
 		if (status != PCI_ERS_RESULT_RECOVERED) {
 			pci_warn(dev, "link reset failed\n");
 			goto failed;
 		}
 	} else {
-		pci_walk_bus(bus, report_normal_detected,
&status);
+		pci_walk_below_dev(dev, report_normal_detected,
&status);
 	}

 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
 		status = PCI_ERS_RESULT_RECOVERED;
 		pci_dbg(dev, "broadcast mmio_enabled
message\n");
-		pci_walk_bus(bus, report_mmio_enabled,
&status);
+		pci_walk_below_dev(dev, report_mmio_enabled,
&status);
 	}

 	if (status == PCI_ERS_RESULT_NEED_RESET) {
@@ -188,17 +194,20 @@ pci_ers_result_t
pcie_do_recovery(struct
pci_dev
*dev,
 		 */
 		status = PCI_ERS_RESULT_RECOVERED;
 		pci_dbg(dev, "broadcast slot_reset message\n");
-		pci_walk_bus(bus, report_slot_reset, &status);
+		pci_walk_below_dev(dev, report_slot_reset,
&status);
 	}

 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;

 	pci_dbg(dev, "broadcast resume message\n");
-	pci_walk_bus(bus, report_resume, &status);
+	pci_walk_below_dev(dev, report_resume, &status);

-	pci_aer_clear_device_status(dev);
-	pci_aer_clear_nonfatal_status(dev);
+	if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+	     pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+		pci_aer_clear_device_status(dev);
+		pci_aer_clear_nonfatal_status(dev);
+	}
 	pci_info(dev, "device recovery successful\n");
 	return status;

diff --git a/include/linux/pci.h b/include/linux/pci.h
index c79d83304e52..538bf0a76d33 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1411,6 +1411,8 @@ int pci_scan_bridge(struct pci_bus
*bus,
struct
pci_dev *dev, int max,

 void pci_walk_bus(struct pci_bus *top, int (*cb)(struct
pci_dev
*,
void *),
 		  void *userdata);
+void pci_walk_below_dev(struct pci_dev *dev, int
(*cb)(struct
pci_dev
*, void *),
+			void *userdata);
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
--
2.19.1