Dan, thanks for review, see comments inline. On 17.04.23 18:01:41, Dan Williams wrote: > Terry Bowman wrote: > > From: Robert Richter <rrichter@xxxxxxx> > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > to an RCEC. > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > Corrected Internal Errors (CIE). The error source is the id of the > > RCEC. A CXL handler must then inspect the error status in various CXL > > registers residing in the dport's component register space (CXL RAS > > cap) or the dport's RCRB (AER ext cap). [1] > > > > Errors showing up in the RCEC's error handler must be handled and > > connected to the CXL subsystem. Implement this by forwarding the error > > to all CXL devices below the RCEC. Since the entire CXL device is > > controlled only using PCIe Configuration Space of device 0, Function > > 0, only pass it there [2]. These devices have the Memory Device class > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > > can implement the handler. In addition to errors directed to the CXL > > endpoint device, the handler must also inspect the CXL downstream > > port's CXL RAS and PCIe AER external capabilities that is connected to > > the device. > > > > Since CXL downstream port errors are signaled using internal errors, > > the handler requires those errors to be unmasked. This is subject of a > > follow-on patch. > > > > The reason for choosing this implementation is that a CXL RCEC device > > is bound to the AER port driver, but the driver does not allow it to > > register a custom specific handler to support CXL. Connecting the RCEC > > hard-wired with a CXL handler does not work, as the CXL subsystem > > might not be present all the time. The alternative to add an > > implementation to the portdrv to allow the registration of a custom > > RCEC error handler isn't worth doing it as CXL would be its only user. > > Instead, just check for an CXL RCEC and pass it down to the connected > > CXL device's error handler. With this approach the code can entirely > > be implemented in the PCIe AER driver and is independent of the CXL > > subsystem. The CXL driver only provides the handler. > > > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > > > Co-developed-by: Terry Bowman <terry.bowman@xxxxxxx> > > Signed-off-by: Robert Richter <rrichter@xxxxxxx> > > Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> > > Cc: "Oliver O'Halloran" <oohall@xxxxxxxxx> > > Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > Cc: Mahesh J Salgaonkar <mahesh@xxxxxxxxxxxxx> > > Cc: linuxppc-dev@xxxxxxxxxxxxxxxx > > Cc: linux-pci@xxxxxxxxxxxxxxx > > --- > > drivers/pci/pcie/Kconfig | 8 ++++++ > > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 69 insertions(+) > > > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > > index 228652a59f27..b0dbd864d3a3 100644 > > --- a/drivers/pci/pcie/Kconfig > > +++ b/drivers/pci/pcie/Kconfig > > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > > gotten from: > > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > > > +config PCIEAER_CXL > > + bool "PCI Express CXL RAS support" > > + default y > > + depends on PCIEAER && CXL_PCI > > + help > > + This enables CXL error handling for Restricted CXL Hosts > > + (RCHs). > > + > > # > > # PCI Express ECRC > > # > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index 7a25b62d9e01..171a08fd8ebd 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > > return true; > > } > > > > +#ifdef CONFIG_PCIEAER_CXL > > + > > +static bool is_cxl_mem_dev(struct pci_dev *dev) > > +{ > > + /* > > + * A CXL device is controlled only using PCIe Configuration > > + * Space of device 0, Function 0. > > + */ > > + if (dev->devfn != PCI_DEVFN(0, 0)) > > + return false; > > + > > + /* Right now there is only a CXL.mem driver */ > > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > > + return false; > > + > > + return true; > > +} > > This part feels broken because most the errors of concern here are CXL > link generic and that can involve CXL.cache and CXL.mem errors on > devices that are not PCI_CLASS_MEMORY_CXL. This situation feels like it > wants formal acknowledgement in 'struct pci_dev' that CXL links ride on > top of PCIe links. There is already rcec->rcec_ea that holds the RCEC-to-endpoint association. Determining if the RCiEP is a CXL dev is a small check which is exactly what is_cxl_mem_dev() is for. I don't see a benefit in holding the same information in an additional cxl_link structure. And as you also said below, for RCRB handling a CXL driver is needed which is why is_cxl_mem_dev() with the class check is used below. > > If it were not for RCRBs then the PCI core could just do: > > dvsec = pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, > CXL_DVSEC_FLEXBUS_PORT); > > ...at bus scan time to identify devices with active CXL links. RCRBs > unfortunately make it so the link presence can not be detected until a > CXL driver is loaded to read that DVSEC out of MMIO space. In a VH topology those errors can be directly handled in a pci driver for CXL ports, if the portdrv handles that the check could be useful. But this is not subject of this patch series. > > However, I still think that looks like a CXL aware driver registering a > 'struct cxl_link' (for lack of a better name) object with a > corresponding PCI device. That link can indicate whether this is an RCH > topology and whether it needs to do the RCEC walk, and that registration > event can flag the RCEC has having CXL link duties to attend to on AER > events. For CXL awareness of the AER driver the simple checks from above could be used, either called directly for the pci_dev (VH mode), or by walking the RCEC. IMO, a 'struct cxl_link' and a function to register it are not really needed here. > > I suspect 'struct cxl_link' can also be used if/when we get to > incoporating CXL Reset into PCI reset handling. > > > + > > +static bool is_internal_error(struct aer_err_info *info) > > +{ > > + if (info->severity == AER_CORRECTABLE) > > + return info->status & PCI_ERR_COR_INTERNAL; > > + > > + return info->status & PCI_ERR_UNC_INTN; > > +} > > + > > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > > + > > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > > +{ > > + struct aer_err_info *e_info = (struct aer_err_info *)data; > > + > > + if (!is_cxl_mem_dev(dev)) > > + return 0; > > > I assume this also needs to reference the RDPAS if present? That is subject of a follow-on patch. Here I see, why you may need a struct cxl_link. But that list must not reside in the pci_dev, instead a CXL aware driver can look up a self-maintained list of RDPAS mappings (RCEC-to-Downstream Port assosiations) to decide whether to lookup the dport's AER and RAS capablilities. > > CXL 3.0 9.17.1.5 RCEC Downstream Port Association Structure (RDPAS) > > > + > > + /* pci_dev_put() in handle_error_source() */ > > + dev = pci_dev_get(dev); > > + if (dev) > > + handle_error_source(dev, e_info); > > I went looking but missed where does handle_error_source() synchronize > against driver ->remove()? Right, the device_lock() is missing in handle_error_source() while accessing pdrv and calling the handler. Will send a fix. > > > + > > + return 0; > > +} > > + > > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > Naming suggestion... > > Given that the VH topology does not require this scanning and > assoication step, lets call this cxl_rch_handle_error() to make it clear > this is only here to undo the awkwardness of CXL 1.1 platforms hiding > registers from typical PCI scanning. A reference to: > > CXL 3.0 9.11.8 CXL Devices Attached to an RCH > > ...might be useful to a future reader that wonders why the CXL RCH case > is so complicated from an AER perspective. Ok. Thanks, -Robert