On 1/14/2025 5:33 AM, Jonathan Cameron wrote: > On Tue, 7 Jan 2025 08:38:43 -0600 > Terry Bowman <terry.bowman@xxxxxxx> wrote: > >> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not >> apply to CXL devices. Recovery can not be used for CXL devices because of >> potential corruption on what can be system memory. Also, current PCIe UCE >> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP), >> does not begin at the RP/DSP but begins at the first downstream device. >> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate >> CXL recovery is needed because of the different handling requirements >> >> Add a new function, cxl_do_recovery() using the following. >> >> Add cxl_walk_bridge() to iterate the detected error's sub-topology. >> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor >> will begin iteration at the RP or DSP rather than beginning at the >> first downstream device. > I'm still holding out for making pci_walk_bridge() do the same and seeing > what if anything breaks. I can test AER fatal UCE on a PCIe device. Do you have any other ideas for specific testing? A specific device or topology in mind ? Regards, Terry > Other than that I'm fine with this patch. > >> Add cxl_report_error_detected() as an analog to report_error_detected(). >> It will call pci_driver::cxl_err_handlers for each iterated downstream >> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean >> indicating if there was a UCE error detected during handling. >> >> cxl_do_recovery() uses the status from cxl_report_error_detected() to >> determine how to proceed. Non-fatal CXL UCE errors will be treated as >> fatal. If a UCE was present during handling then cxl_do_recovery() >> will kernel panic. >> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>