On Tue, 14 Jan 2025 14:28:13 -0600 "Bowman, Terry" <terry.bowman@xxxxxxx> wrote: > On 1/14/2025 5:33 AM, Jonathan Cameron wrote: > > On Tue, 7 Jan 2025 08:38:43 -0600 > > Terry Bowman <terry.bowman@xxxxxxx> wrote: > > > >> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not > >> apply to CXL devices. Recovery can not be used for CXL devices because of > >> potential corruption on what can be system memory. Also, current PCIe UCE > >> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP), > >> does not begin at the RP/DSP but begins at the first downstream device. > >> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate > >> CXL recovery is needed because of the different handling requirements > >> > >> Add a new function, cxl_do_recovery() using the following. > >> > >> Add cxl_walk_bridge() to iterate the detected error's sub-topology. > >> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor > >> will begin iteration at the RP or DSP rather than beginning at the > >> first downstream device. > > I'm still holding out for making pci_walk_bridge() do the same and seeing > > what if anything breaks. > > I can test AER fatal UCE on a PCIe device. Do you have any other ideas for specific > testing? A specific device or topology in mind ? It's the interaction with runtime power management usage that worries me and might need wider testing. Maybe it is just a case of sending a patch marked RFT. The other paths are no-op where it matters. Jonathan > > Regards, > Terry > > > Other than that I'm fine with this patch. > > > >> Add cxl_report_error_detected() as an analog to report_error_detected(). > >> It will call pci_driver::cxl_err_handlers for each iterated downstream > >> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean > >> indicating if there was a UCE error detected during handling. > >> > >> cxl_do_recovery() uses the status from cxl_report_error_detected() to > >> determine how to proceed. Non-fatal CXL UCE errors will be treated as > >> fatal. If a UCE was present during handling then cxl_do_recovery() > >> will kernel panic. > >> > >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> >