On Wed, 13 Nov 2024 15:54:19 -0600 Terry Bowman <terry.bowman@xxxxxxx> wrote: > The AER service driver supports handling downstream port protocol errors in > restricted CXL host (RCH) mode also known as CXL1.1. It needs the same > functionality for CXL PCIe ports operating in virtual hierarchy (VH) > mode.[1] > > CXL and PCIe protocol error handling have different requirements that > necessitate a separate handling path. The AER service driver may try to > recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not > suitable for CXL PCIe port devices because of potential for system memory > corruption. Instead, CXL protocol error handling must use a kernel panic > in the case of a fatal or non-fatal UCE. The AER driver's PCIe error > handling does not panic the kernel in response to a UCE. > > Introduce a separate path for CXL protocol error handling in the AER > service driver. This will allow CXL protocol errors to use CXL specific > handling instead of PCIe handling. Add the CXL specific changes without > affecting or adding functionality in the PCIe handling. > > Make this update alongside the existing downstream port RCH error handling > logic, extending support to CXL PCIe ports in VH mode. > > is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel > config. Update is_internal_error()'s function declaration such that it is > always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled > or disabled. > > The uncorrectable error (UCE) handling will be added in a future patch. > > [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and > Upstream Switch Ports > > Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > Reviewed-by: Dave Jiang <dave.jiang@xxxxxxxxx> I took another look and so a question inline. Jonathan > static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data) > { > struct aer_err_info *info = (struct aer_err_info *)data; > @@ -1033,14 +1032,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data) > > static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > { > - /* > - * Internal errors of an RCEC indicate an AER error in an > - * RCH's downstream port. Check and handle them in the CXL.mem > - * device driver. > - */ > - if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > - is_internal_error(info)) > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) > pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info); > + > + if (info->severity == AER_CORRECTABLE) { > + struct pci_driver *pdrv = dev->driver; > + int aer = dev->aer_cap; > + > + if (aer) How do we get here with no aer? On a PCIe device AER is optional, but not I think on a CXL device (I can't find the text but there is a change log entry that says to clarify that it is required for CXL devices) Maybe the optionality is why the PCIe code has this check. Anyhow, I don't really mind keeping it, was just curious. > + pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, > + info->status); > + > + if (pdrv && pdrv->cxl_err_handler && > + pdrv->cxl_err_handler->cor_error_detected) > + pdrv->cxl_err_handler->cor_error_detected(dev); > + > + pcie_clear_device_status(dev); > + } > }