Terry Bowman wrote: > Existing recovery procedure for PCIe uncorrectable errors (UCE) does not > apply to CXL devices. Recovery can not be used for CXL devices because of > potential corruption on what can be system memory. Also, current PCIe UCE > recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP), > does not begin at the RP/DSP but begins at the first downstream device. > This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate > CXL recovery is needed because of the different handling requirements > > Add a new function, cxl_do_recovery() using the following. > > Add cxl_walk_bridge() to iterate the detected error's sub-topology. > cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor > will begin iteration at the RP or DSP rather than beginning at the > first downstream device. > > Add cxl_report_error_detected() as an analog to report_error_detected(). > It will call pci_driver::cxl_err_handlers for each iterated downstream > device. The pci_driver::cxl_err_handler's UCE handler returns a boolean > indicating if there was a UCE error detected during handling. > > cxl_do_recovery() uses the status from cxl_report_error_detected() to > determine how to proceed. Non-fatal CXL UCE errors will be treated as > fatal. If a UCE was present during handling then cxl_do_recovery() > will kernel panic. > > Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> > --- [snip] > + > +static int cxl_report_error_detected(struct pci_dev *dev, void *data) > +{ > + const struct cxl_error_handlers *cxl_err_handler; > + struct pci_driver *pdrv = dev->driver; > + bool *status = data; > + > + device_lock(&dev->dev); > + if (pdrv && pdrv->cxl_err_handler && > + pdrv->cxl_err_handler->error_detected) { > + cxl_err_handler = pdrv->cxl_err_handler; > + *status = cxl_err_handler->error_detected(dev); > + } > + device_unlock(&dev->dev); > + return *status; This is probably just another nit on my part but returning bool here for int may cause issues down the road. Looking at this I wonder if it would be better to add *_PANIC to pci_ers_result_t and return that similar to report_error_detected()? > +} > + > +void cxl_do_recovery(struct pci_dev *dev) > +{ > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > + int type = pci_pcie_type(dev); > + struct pci_dev *bridge; > + int status; > + > + if (type == PCI_EXP_TYPE_ROOT_PORT || > + type == PCI_EXP_TYPE_DOWNSTREAM || > + type == PCI_EXP_TYPE_UPSTREAM || > + type == PCI_EXP_TYPE_ENDPOINT) > + bridge = dev; > + else > + bridge = pci_upstream_bridge(dev); > + > + cxl_walk_bridge(bridge, cxl_report_error_detected, &status); > + if (status) > + panic("CXL cachemem error."); > + > + if (host->native_aer || pcie_ports_native) { > + pcie_clear_device_status(dev); > + pci_aer_clear_nonfatal_status(dev); > + } There is a nice informative comment in pcie_do_recovery() about this block. I think we should combine this and that block into a new function which preserves that for both paths. Ira > + > + pci_info(bridge, "CXL uncorrectable error.\n"); > +} > -- > 2.34.1 >