Re: [PATCH v7 04/17] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type

"Bowman, Terry" <terry.bowman@xxxxxxx> · Wed, 12 Feb 2025 15:08:10 -0600

On 2/12/2025 1:57 PM, Dan Williams wrote:
> Bowman, Terry wrote:
> [..]
>>> Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>> Ok. I can add is_cxl to 'struct aer_err_info'. Shall I set it by reading the
>> alternate protocol link state?
> I am thinking no because dev->is_cxl at least indicates that a CXL link
> was up at some point, and racing CXL link down is not something the
> error core can reasonably mitigate.
>
> In the end I think that it should be something like:
>
>    info->is_cxl = dev->is_cxl && is_internal_error()
>
> ...on the expectation that a CXL device is unlikely to multiplex
> internal errors across CXL protocol error events and device-specific
> internal events. Even if a device *did* multiplex those I think it is
> reasonable for the kernel to treat a device-specific UCE the same as a
> CXL protocol UCE and panic the system.
Ok.

I found in using is_internal_error() (v5) a USP with fatal UCE will not have AER status
populated in aer_info structure, only the severity field is populated (see
aer_get_device_error_info()). The aer_info is not populated because concern reading
the USP's AER (config space) when the upstream link state is invalid. Calling
is_internal_error() in this case will return false because the uncorrectable internal error (UIE) bit is 0 and proceed to treat as a PCIe error.
How do you want to proceed to handle the UCE protocol error in this case?

Terry