Re: [PATCH v7 06/17] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver

"Bowman, Terry" <terry.bowman@xxxxxxx> · Fri, 14 Feb 2025 13:36:35 -0600



On 2/11/2025 6:24 PM, Dan Williams wrote:
> Terry Bowman wrote:
>> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
>> apply to CXL devices. Recovery can not be used for CXL devices because of
>> potential corruption on what can be system memory. Also, current PCIe UCE
>> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
>> does not begin at the RP/DSP but begins at the first downstream device.
>> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
>> CXL recovery is needed because of the different handling requirements
>>
>> Add a new function, cxl_do_recovery() using the following.
>>
>> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
>> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
>> will begin iteration at the RP or DSP rather than beginning at the
>> first downstream device.
>>
>> pci_walk_bridge() is candidate to possibly reuse cxl_walk_bridge() but
>> needs further investigation. This will be left for future improvement
>> to make the CXL and PCI handling paths more common.
>>
>> Add cxl_report_error_detected() as an analog to report_error_detected().
>> It will call pci_driver::cxl_err_handlers for each iterated downstream
>> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
>> indicating if there was a UCE error detected during handling.
>>
>> cxl_do_recovery() uses the status from cxl_report_error_detected() to
>> determine how to proceed. Non-fatal CXL UCE errors will be treated as
>> fatal. If a UCE was present during handling then cxl_do_recovery()
>> will kernel panic.
> For what this is:
>
> Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>
> ...and perhaps it addresses my concern on the prior patch that
> ->error_detected() is responsible for the safety of checking that in
> fact a CXL internal error / UCE was detected.
Thanks Dan