On Fri, 25 Oct 2024 16:02:56 -0500 Terry Bowman <terry.bowman@xxxxxxx> wrote: > The AER service driver doesn't currently handle CXL protocol errors > reported by CXL root ports, CXL upstream switch ports, and CXL downstream > switch ports. Consequently, RAS protocol errors from CXL PCIe port devices > are not properly logged or handled. > > These errors are reported to the OS via the root port's AER correctable > and uncorrectable internal error fields. While the AER driver supports > handling downstream port protocol errors in restricted CXL host (RCH) mode > also known as CXL1.1, it lacks the same functionality for CXL PCIe ports > operating in virtual hierarchy (VH) mode. > > To address this gap, update the AER driver to handle CXL PCIe port device > protocol correctable errors (CE). > > Make this update alongside the existing downstream port RCH error handling > logic, extending support to CXL PCIe ports in VH mode. > > is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel > config. Update is_internal_error()'s function declaration such that it is > always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled > or disabled. > > The uncorrectable error (UCE) handling will be added in a future patch. > > [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and > Upstream Switch Ports > > Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> This is a fiddly patch to read, but that's at least partly diff going crazy in a few places. Anyhow, I think it is fine but I would call out that this changes things so that the PCI error handlers are no longer called for CXL ports if it's an internal error. With a sentence on that: Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> I'm not 100% convinced the path of separate handlers is the way to go but we can always change things again if that doesn't work out. Jonathan > --- > drivers/pci/pcie/aer.c | 59 ++++++++++++++++++++++++++++-------------- > 1 file changed, 39 insertions(+), 20 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 53e9a11f6c0f..1d3e5b929661 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent, > return true; > } > > -#ifdef CONFIG_PCIEAER_CXL > +static bool is_internal_error(struct aer_err_info *info) > +{ > + if (info->severity == AER_CORRECTABLE) > + return info->status & PCI_ERR_COR_INTERNAL; > > + return info->status & PCI_ERR_UNC_INTN; > +} > + > +#ifdef CONFIG_PCIEAER_CXL Diff was having fun. Maybe put a blank line here? I think that's what has tripped it up. > /** > * pci_aer_unmask_internal_errors - unmask internal errors > * @dev: pointer to the pcie_dev data structure > @@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev) > return (pcie_ports_native || host->native_aer); > } > - > /** > @@ -1115,8 +1131,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > { > - cxl_handle_error(dev, info); > - pci_aer_handle_error(dev, info); > + if (is_internal_error(info) && handles_cxl_errors(dev)) > + cxl_handle_error(dev, info); > + else > + pci_aer_handle_error(dev, info); Whilst not calling this for the CXL cases probably makes sense and given new code needs to be the case to avoid a double clear I think, I would call that change out more explicitly in the patch description. > + > pci_dev_put(dev); > } >